Datasets: Wikipedia pagecounts-all-sites sorted by page (years 2014 – 2016)
This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, ordered by
(project, page, timestamp). It has been created by processing Wikimedia’s
The original dataset holds output from September, 2014 to August, 2016 that mimics
pagecounts-raw files, but gets generated from Hadoop data using Hive. The original dataset holds the desktop sites’ pageview data with hourly granularity for all Wikipedia editions. More info about the original dataset are available on Wikitech. Note that these are not unique visits. Note that this dataset files have a one hour shift later than any other dataset handled by the analytics-team (particularly webrequest, pageview-hourly, projectview-hourly). For instance for data between
2018-09-27T14:00:00, pagecounts-all-sites uses
2018-09-27T14:00:00 while other dataset uses
The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:
project: the project name
page: the page requested, url-escaped
timestamp: the timestamp of the hour (format:
count: the number of times the page has been requested (in that hour)
bytes: the number of bytes transferred (in that hour)
The original dataset has been normalized in the following ways:
projectcolumn has been converted to lowercase
pagecolumns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:
- if two lines are now equal because of this normalization, their
bytescolumns are summed up.
This dataset is split in many gzip’d files, each of them containing 1,000,000 records.
index folder is included: for every month there is a file describing the the first record of every partial file.
Here’s an excerpt of the file
pagecounts-20140923-010000.gz (lines 1612915–1612924):
This dataset can be downloaded in two different ways:
You can find the dataset on:
You can use the scripts at
pagecounts-download-tools on GitHub.
How-to download a month worth of data
clone the repository:
╭─ ~ ╰─$ git clone https://github.com/CristianCantoro/pagecounts-download-tools
go to the `sizes` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/sizes ╰─$ ./download_sizes.sh http://cricca.disi.unitn.it/datasets/pagecounts-all-sites-sorted/
go to the `downloadlists` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/downloadlists ╰─$ ./make_lists.sh ../sizes/2014-09.txt http://cricca.disi.unitn.it/datasets/pagecounts-all-sites-sorted/
from the repository base directory and dowload files:
╭─ ~/pagecounts-download-tools ╰─$ ./make_lists.sh ./download.sh -d 2014 9
You can download the dataset using
dat, the dataset is available at
Once you have installed
dat, you can download the dataset with:
dat clone dat://d4ac75cda06e991b3181abb7365a1761581c2d54e962f14015f52ed5c8e9f6b2 ~/dat-wikipedia-pagecounts-all-sites-sorted
- This dataset has been produced using Apache Spark on Microsoft Azure with this script by Alessio Bogon.
- The Python module pagecounts-search provides a command-line utility to query this dataset.
- The repository wikipedia-pageviews-extraction contains a collection of utility to extract pageviews data for groups of articles taking into account the existence of redirects.
- Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
- Alberto Montresor, DISI - University of Trento
You can reuse this dataset under the same license.
How to cite
Please cite this dataset as:
Cristian Consonni, Alberto Montresor. Wikipedia pagecounts-raw sorted by article. doi coming soon
This dataset superseedes the previous version with just the data from 2014: doi:10.6084/m9.figshare.2085643.v1, see also datasets/wikipedia-pagecounts-sorted-by-page-year-2014.
What is the total size of the dataset, the number of files and the largest file in the dataset?
The total dataset size is 1.1TB, and it contains ~15,800 files. The average size is 35 MB and the largest file is 1.22GB.
How are files organized?
Files are divided in directories, one for each month, like this:
The average directory size is ~45GB, and on average each contains 657 part files.
Who produced this dataset and why?
The dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy.
This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program.
This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
Is this dataset used in currently in-review or future papers that cite it?
This dataset has been used for multiple papers that are currently in-review or in preparation.
This dataset is published as part of the ENGINEROOM project and it will be cited in the related scientific publications.
For further info send me an e-mail.