Datasets: Wikipedia pagecounts sorted by page (year 2014)
This dataset is superseeded by Wikipedia pagecounts-raw sorted by page (years 2007-2016).
The reference page for this dataset is also available at: doi:10.6084/m9.figshare.2085643.v1
This dataset contains the page view statistics for all the Wikimedia projects in the year 2014, ordered by (project, page, timestamp). It has been generated starting from the Wikimedia’s pagecounts-raw dataset.
The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:
project
: the project namepage
: the page requested, url-escapedtimestamp
: the timestamp of the hour (format:%Y%m%d-%H%M%S
)count
: the number of times the page has been requested (in that hour)bytes
: the number of bytes transferred (in that hour)
The original dataset has been normalized in the following ways:
- the
project
column has been converted to lowercase - the
page
columns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:
- if two lines are now equal because of this normalization, their
count
andbytes
columns are summed up.
This dataset is split in many gzip’d files, each of them containing 1,000,000 records.
An index
file is included: every line represent the first record of every
partial file.
Sample (taken from part-0000011138.gz
, starting from line 411917).
You can find an extended sample in the file en_Albert_Einstein.pagecounts.txt
.
Download
You can download the data via torrent: each torrent file contains 10k file (~60GB, compressed). Each torrent comes with the index of the pages it contains.
We are planning to upload this data also to the Internet Archive.
- index of pages
- pagecounts-2014-0000000000-0000009999.torrent
- pagecounts-2014-0000010000-0000019999.torrent
- pagecounts-2014-0000020000-0000029999.torrent
- pagecounts-2014-0000030000-0000039999.torrent
- pagecounts-2014-0000040000-0000049999.torrent
- pagecounts-2014-0000050000-0000059999.torrent
- pagecounts-2014-0000060000-0000069999.torrent
- MD5 sums for the files above
Code
- This dataset has been produced using Apache Spark on the Cisca-Cluster at the University of Trento with this script;
- The Python module pagecounts-search provides a command-line utility to query this dataset.
Authors
- Alessio Bogon (keybase.io/youtux), DISI - University of Trento
- Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
- Alberto Montresor, DISI - University of Trento
License
The original dataset was published in the Public Domain (Public Domain Mark 1.0).
You can reuse this dataset under the Creative Commons - Attribution (CC BY) 4.0 license.
How to cite
Please cite this dataset as:
Alessio Bogon, Cristian Consonni, Alberto Montresor. Wikipedia pagecounts by page. doi:10.6084/m9.figshare.2085643.v1
Questions?
For further info send me an e-mail.