Cristian Consonni bio photo

Cristian Consonni

Ph.D. in Computer Science, free software activist, physicist and storyteller

Email Twitter Facebook LinkedIn Github Stackoverflow keybase

Dataset: WikiLinkGraphs - A complete, longitudinal and multi-language dataset of the Wikipedia link networks

WikiLinkGraphs is a dataset of the network of internal Wikipedia links for 9 language editions: de, en, es, fr, it, nl, pl, ru, sv. This dataset spans over 17 years, from the creation of Wikipedia in 2001 to March 2018. The dataset has been produced by processing Wikimedia’s history dumps.

WikiLinkGraphs. This dataset is part of the WikiLinkGraphs family, a collection of datasets extracted from Wikipedia history dumps. See the other datasets:

Description

  • page_id_from: an integer, the page identifier (used by MediaWiki) of the source article. This identifier is not necessarily progressive, there may be gaps in the enumeration;
  • page_title_from: a string, the title of the source Wikipedia article;
  • page_id: an integer, the page identifier (used by MediaWiki) of the target page. This identifier is not necessarily progressive, there may be gaps in the enumeration;
  • page_title: a string, the title of the target Wikipedia article;

Sample

Extract of the file enwiki.wikilink_graph.2018-03-01.csv.gz:

page_id_from    page_title_from page_id_to      page_title_to
10      AccessibleComputing     411964  Computer accessibility
12      Anarchism       5013592 6 February 1934 crisis
12      Anarchism       2181459 Abstentionism
12      Anarchism       839656  Adolf Brand
12      Anarchism       2731583 Adolf Hitler
12      Anarchism       192008  Adolphe Thiers
12      Anarchism       729048  Affinity group
12      Anarchism       30758   Age of Enlightenment
12      Anarchism       627     Agriculture
12      Anarchism       710931  AK Press

Download

This dataset can be downloaded from Zenodo: doi:10.5281/zenodo.2539424.

Code

This dataset has been processed with Python, see the wikidump project and the other repositories in the WikiLinkGraphs organization.

Authors

This dataset has been produced by:

  • Cristian Consonni – DISI, University of Trento, Trento, Italy.
  • David Laniado – Eurecat, Centre Tecnològic de Catalunya, Barcelona, Spain.
  • Alberto Montresor – DISI, University of Trento, Trento, Italy.

This dataset has been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.

License

This dataset is released under Creative Commons Attribution 4.0 International.

The original dump is released under the GNU Free Documentation License (GFDL) and the Creative Commons Attribution-Share-Alike 3.0 License, see the legal info.

How to cite

If you use this dataset please cite the main WikiLinkGraphs paper:

Consonni, Cristian, David Laniado, and Alberto Montresor. “WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks.”

FAQs

What is the total size of the dataset, the number of files and the largest file in the dataset?

  • 5.7G dewiki.wikilink_graph.*.csv.gz
  • 17G enwiki.wikilink_graph.*.csv.gz
  • 3.0G eswiki.wikilink_graph.*.csv.gz
  • 4.8G frwiki.wikilink_graph.*.csv.gz
  • 3.1G itwiki.wikilink_graph.*.csv.gz
  • 2.0G nlwiki.wikilink_graph.*.csv.gz
  • 2.3G plwiki.wikilink_graph.*.csv.gz
  • 3.2G ruwiki.wikilink_graph.*.csv.gz
  • 2.0G svwiki.wikilink_graph.*.csv.gz

The total dataset size is 42GB, and it contains ~ 172 files. The average size is 244 MB and the largest file is ~ 2.4GB.

Who produced this dataset and why?

  • This dataset has been produced by Cristian Consonni, David Laniado and Alberto Montresor.
  • Cristian Consonni and Alberto Montresor are affiliated with the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; David is affiliated with Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain.
  • This dataset has also been produced as part of the research related to the ENGINEROOM project. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.

Questions?

For further info send me an e-mail.