Cristian Consonni bio photo

Cristian Consonni

Ph.D. in Computer Science, free software activist, physicist and storyteller

Email Twitter Facebook LinkedIn Github Stackoverflow keybase

Dataset: WikiLinkGraphs' RawWikilinks Snapshots

This dataset contains wikilink snapshots, i.e. links between Wikipedia articles, extracted by processing each revision of each Wikipedia article (namespace 0) from Wikimedia’s history dumps for the languages de, en, es, fr, it, nl, pl, ru, sv. The snapshots were taken on March 1st, for the years between 2001 and 2018 (included).

WikiLinkGraphs. This dataset is part of the WikiLinkGraphs family, a collection of datasets extracted from Wikipedia history dumps. See the other datasets:

Description

  • page_id: an integer, the page identifier used by MediaWiki. This identifier is not necessarily progressive, there may be gaps in the enumeration;
  • page_title: a string, the title of the Wikipedia article;
  • revision_id: an integer, the identifier of a revision of the article, also called a permanent id, because it can be used to link to that specific revision of a Wikipedia article;
  • revision_parent_id: an integer, the identifier of the parent revision. In general, each revision as a unique parent; going back in time before 2002, however, we can see that the oldest articles present non-linear edit histories. This is a consequence of the import process from the software previously used to power Wikipedia, MoinMonWiki, to MediaWiki;
  • revision_timestamp: date and time of the edit that generated the revision under consideration;
  • user_type: a string ("registered" or "anonymous"), specifying whether the user making the revision was logged-in or not;
  • user_username: a string, the username of the user that made the edit that generated the revision under consideration;
  • user_id: an integer, the identifier of the user that made the edit that generated the revision under consideration;
  • revision_minor: a boolean flag, with value 1 if the edit that generated the current revision was marked as minor by the user, 0 otherwise;
  • wikilink.link: a string, the page linked by the wikilink;
  • wikilink.anchor: a string, the anchor text of the wikilink;
  • wikilink.section_name: the name of the section wherein the wikilink appears;
  • wikilink.section_level: the level of the section wherein the wikilink appears;
  • wikilink.section_number: the number of the section wherein the wikilink appears;
  • wikilink.is_active: a boolean representing whether the page pointed to by the link was existing in that moment or not.

Sample

Extract of the file enwiki.link_snapshot.2018-03-01.csv.gz in enwiki/20180301/:

page_id,page_title,revision_id,revision_parent_id,revision_timestamp,user_type,user_username,user_id,revision_minor,wikilink.link,wikilink.tosection,wikilink.anchor,wikilink.section_name,wikilink.section_level,wikilink.section_number,wikinlink.is_active
10,AccessibleComputing,767284433,631144794,2017-02-25T00:30:28Z,registered,Godsy,23257138,0,Computer accessibility,,Computer accessibility,---~--- incipit ---~---,0,0,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,6 February 1934 crisis,,February 1934 riots, Conflicts with European fascist regimes ,3,8,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,abstentionism,,abstentionism, First International and the Paris Commune ,3,4,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,Adolf Brand,,Adolf Brand, Individualist anarchism ,3,18,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,Adolf Hitler,,Hitler, Spanish Revolution ,3,9,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,Adolphe Thiers,,Versailles, Propaganda of the deed and illegalism ,3,6,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,affinity group,,affinity group, Contemporary anarchism ,3,11,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,affinity group,,affinity group, Post-classical anarchist schools of thought ,3,19,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,Age of Enlightenment,,Enlightenment, Origins ,3,3,1
12,Anarchism,828135433,827702904,2018-02-28T19:35:35Z,registered,Hydrargyrum,291919,0,agriculture,,agrarian, Spanish Revolution ,3,9,1

Download

This dataset can be downloaded in two different ways:

HTTP (preferred method)

You can find the dataset on: cricca.disi.unitn.it/datasets/wikilinkgraphs-rawwikilinks-snapshots.

You can download the dataset with the following command:

dataset='wikilinkgraphs-rawwikilinks-snapshots'; adate=20180301; \
langs=( 'dewiki' 'enwiki'  'eswiki'  'frwiki'  'itwiki'  'nlwiki'  'plwiki'  'ruwiki' 'svwiki' ); \
for lang in "${langs[@]}"; do
  lynx \
    -dump \
    -listonly \
      "http://cricca.disi.unitn.it/datasets/${dataset}/${lang}/${adate}/" | \
  awk '{print $2}' | \
  grep -E "^http://cricca\.disi\.unitn\.it/datasets/${dataset}/" | \
  xargs -L1 -I{} wget -R '\?C=' {}
done

dat (experimental)

(coming soon)

Code

This dataset has been processed with Python, see the wikidump project and the other repositories in the WikiLinkGraphs organization.

Authors

This dataset has been produced by:

  • Cristian Consonni – DISI, University of Trento, Trento, Italy.
  • David Laniado – Eurecat, Centre Tecnològic de Catalunya, Barcelona, Spain.
  • Alberto Montresor – DISI, University of Trento, Trento, Italy.

This dataset has been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.

License

This dataset is released under Creative Commons Attribution 4.0 International.

The original dump is released under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License, see the legal info.

How to cite

If you use this dataset please cite the main WikiLinkGraphs paper:

Consonni, Cristian, David Laniado, and Alberto Montresor. “WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks.”

FAQs

What is the total size of the dataset, the number of files and the largest file in the dataset?

For each of the 9 languages you will find 18 gzipped files, one for each snapshot from 2001 to 2018 (included). The total dataset size is 79TB, divided among the languages like this: 11G dewiki/ 29G enwiki/ 5.9G eswiki/ 9.2G frwiki/ 5.7G itwiki/ 4.1G nlwiki/ 4.8G plwiki/ 6.8G ruwiki/ 3.9G svwiki/

The dataset contains 162 files. The average file size is 0.5GB and the largest file is ~3.8GB (enwiki’s snapshot from 2018-03-01).

How are files organized?

Files are divided in directories, one for each language, each directory contains 18 files, one for each year from 2001 to 2018 (included). Like this:

.
├── dewiki
│   └── 20180301
│       ├── dewiki.link_snapshot.2001-03-01.csv.gz
│       ├── dewiki.link_snapshot.2002-03-01.csv.gz
│       ├── ---
│       └── dewiki.link_snapshot.2018-03-01.csv.gz
├── enwiki
│   └── 20180301
│       ├── enwiki.link_snapshot.2001-03-01.csv.gz
│       ├── enwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── enwiki.link_snapshot.2018-03-01.csv.gz
├── eswiki
│   └── 20180301
│       ├── eswiki.link_snapshot.2001-03-01.csv.gz
│       ├── eswiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── eswiki.link_snapshot.2018-03-01.csv.gz
├── frwiki
│   └── 20180301
│       ├── frwiki.link_snapshot.2001-03-01.csv.gz
│       ├── frwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── frwiki.link_snapshot.2018-03-01.csv.gz
├── itwiki
│   └── 20180301
│       ├── itwiki.link_snapshot.2001-03-01.csv.gz
│       ├── itwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── itwiki.link_snapshot.2018-03-01.csv.gz
├── nlwiki
│   └── 20180301
│       ├── nlwiki.link_snapshot.2001-03-01.csv.gz
│       ├── nlwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── nlwiki.link_snapshot.2018-03-01.csv.gz
├── plwiki
│   └── 20180301
│       ├── plwiki.link_snapshot.2001-03-01.csv.gz
│       ├── plwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── plwiki.link_snapshot.2018-03-01.csv.gz
├── ruwiki
│   └── 20180301
│       ├── ruwiki.link_snapshot.2001-03-01.csv.gz
│       ├── ruwiki.link_snapshot.2002-03-01.csv.gz
│       ├── ...
│       └── ruwiki.link_snapshot.2018-03-01.csv.gz
└── svwiki
    └── 20180301
        ├── svwiki.link_snapshot.2001-03-01.csv.gz
        ├── svwiki.link_snapshot.2002-03-01.csv.gz
        ├── ...
        └── svwiki.link_snapshot.2018-03-01.csv.gz

Who produced this dataset and why?

  • This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
  • Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
  • This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.

Questions?

For further info send me an e-mail.