Datasets: Wikipedia pagecounts sorted by page (year 2014)
This dataset is superseeded by Wikipedia pagecounts-raw sorted by page (years 2007-2016).
The reference page for this dataset is also available at: doi:10.6084/m9.figshare.2085643.v1
This dataset contains the page view statistics for all the Wikimedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the Wikimedia’s pagecounts-raw dataset.
The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:
project: the project name
page: the page requested, url-escaped
timestamp: the timestamp of the hour (format:
count: the number of times the page has been requested (in that hour)
bytes: the number of bytes transferred (in that hour)
The original dataset has been normalized in the following ways:
projectcolumn has been converted to lowercase
pagecolumns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:
- if two lines are now equal because of this normalization, their
bytescolumns are summed up.
This dataset is split in many gzip’d files, each of them containing 1,000,000 records.
index file is included: every line represent the first record of every
You can find an extended sample in the file
You can download the data via torrent: each torrent file contains 10k file (~60GB, compressed). Each torrent comes with the index of the pages it contains.
We are planning to upload this data also to the Internet Archive.
- index of pages
- MD5 sums for the files above
- This dataset has been produced using Apache Spark on the Cisca-Cluster at the University of Trento with this script;
- The Python module pagecounts-search provides a command-line utility to query this dataset.
- Alessio Bogon (keybase.io/youtux), DISI - University of Trento
- Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
- Alberto Montresor, DISI - University of Trento
You can reuse this dataset under the Creative Commons - Attribution (CC BY) 4.0 license.
How to cite
Please cite this dataset as:
Alessio Bogon, Cristian Consonni, Alberto Montresor. Wikipedia pagecounts by page. doi:10.6084/m9.figshare.2085643.v1
For further info send me an e-mail.