Dataset: WikiLinkGraphs' ResolvedRedirects
This dataset contains Wikipedia snapshots with resolved redirects, i.e. list of pages (with a particular revision) of Wikipedia on March, 1st for each year from 2001 to 2018 (included), with redirects indicating which page was pointed at the moment. It has been produced by processing Wikimedia’s history dumps for the languages de, en, es, fr, it, nl, pl, ru, sv.
- rawwikilinks
- rawwikilinks-snapshots
- revisionlist
- snapshots
- redirects
- resolved-redirects (this one)
- wikilinkgraphs
Description
page_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;page_title
: a string, the title of the Wikipedia article;revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;revision_timestamp
: date and time of the edit that generated the revision under consideration;redirect_id
: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;redirect_title
: a string, the title of the Wikipedia article;redirect_revision_id
: an integer, the identifier of a revision of the article, also called apermanent id , because it can be used to link to that specific revision of a Wikipedia article;redirect_revision_parent_id
: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;redirect_revision_timestamp
: date and time of the edit that generated the revision under consideration;
Sample
Extract of the file enwiki.snapshot.resolve_redirect.2018-03-01.csv.gz
in enwiki/20180301/
:
Download
This dataset can be downloaded in two different ways:
HTTP (preferred method)
You can find the dataset on: cricca.disi.unitn.it/datasets/wikilinkgraphs-resolved-redirects
.
You can download the dataset with the following command:
dataset='wikilinkgraphs-resolved-redirects'; adate=20180301; \ langs=( 'dewiki' 'enwiki' 'eswiki' 'frwiki' 'itwiki' 'nlwiki' 'plwiki' 'ruwiki' 'svwiki' ); \ for lang in "${langs[@]}"; do lynx \ -dump \ -listonly \ "http://cricca.disi.unitn.it/datasets/${dataset}/${lang}/${adate}/" | \ awk '{print $2}' | \ grep -E "^http://cricca\.disi\.unitn\.it/datasets/${dataset}/" | \ xargs -L1 -I{} wget -R '\?C=' {} done
dat (experimental)
(coming soon)
Code
This dataset has been processed with Python, see the wikidump
project and the other repositories in the WikiLinkGraphs organization.
Authors
This dataset has been produced by:
- Cristian Consonni – DISI, University of Trento, Trento, Italy.
- David Laniado – Eurecat, Centre Tecnològic de Catalunya, Barcelona, Spain.
- Alberto Montresor – DISI, University of Trento, Trento, Italy.
This dataset has been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
License
This dataset is released under Creative Commons Attribution 4.0 International.
The original dump is released under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License, see the legal info.
How to cite
If you use this dataset please cite the main WikiLinkGraphs paper:
Consonni, Cristian, David Laniado, and Alberto Montresor. “WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks.”
FAQs
What is the total size of the dataset, the number of files and the largest file in the dataset?
The total dataset size is 8.4GB, divided among the languages like this:
- 924M dewiki/
- 3,5G enwiki/
- 662M eswiki/
- 844M frwiki/
- 483M itwiki/
- 512M nlwiki/
- 416M plwiki/
- 659M ruwiki/
- 528M svwiki/
The dataset contains 162 files. The average file size is 56.1MB and the largest file is 451MB (enwiki’s latest snapshot on 2018-03-01).
How are files organized?
Files are divided in directories, one for each language, like this:
Who produced this dataset and why?
- This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
- Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
- This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
Questions?
For further info send me an e-mail.