Datasets: Wikipedia pagecounts sorted by page (year 2014)

This dataset is superseeded by Wikipedia pagecounts-raw sorted by page (years 2007-2016).

The reference page for this dataset is also available at: doi:10.6084/m9.figshare.2085643.v1

This dataset contains the page view statistics for all the Wikimedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the Wikimedia’s pagecounts-raw dataset.

The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:

project: the project name
page: the page requested, url-escaped
timestamp: the timestamp of the hour (format: %Y%m%d-%H%M%S)
count: the number of times the page has been requested (in that hour)
bytes: the number of bytes transferred (in that hour)

The original dataset has been normalized in the following ways:

the project column has been converted to lowercase
the page columns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:

import urllib.parse
page_unquoted = urllib.parse.unquote(page, encoding='utf-8',
    errors='replace')
page_requoted = urllib.parse.quote(page_unquoted)

if two lines are now equal because of this normalization, their count and bytes columns are summed up.

This dataset is split in many gzip’d files, each of them containing 1,000,000 records.

An index file is included: every line represent the first record of every partial file.

Sample (taken from part-0000011138.gz, starting from line 411917).

en Albert_Einstein 20140101-000000 300 25645681
en Albert_Einstein 20140101-010000 246 21173395
en Albert_Einstein 20140101-020000 276 23558819
en Albert_Einstein 20140101-030000 234 17418623
en Albert_Einstein 20140101-040000 283 21449007
en Albert_Einstein 20140101-050000 289 23254304

You can find an extended sample in the file en_Albert_Einstein.pagecounts.txt.

Download

You can download the data via torrent: each torrent file contains 10k file (~60GB, compressed). Each torrent comes with the index of the pages it contains.

We are planning to upload this data also to the Internet Archive.

Code

This dataset has been produced using Apache Spark on the Cisca-Cluster at the University of Trento with this script;
The Python module pagecounts-search provides a command-line utility to query this dataset.

Authors

Alessio Bogon (keybase.io/youtux), DISI - University of Trento
Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
Alberto Montresor, DISI - University of Trento

License

The original dataset was published in the Public Domain (Public Domain Mark 1.0).

You can reuse this dataset under the Creative Commons - Attribution (CC BY) 4.0 license.

How to cite

Please cite this dataset as:

Alessio Bogon, Cristian Consonni, Alberto Montresor. Wikipedia pagecounts by page. doi:10.6084/m9.figshare.2085643.v1

Questions?

For further info send me an e-mail.