Datasets: Wikipedia pagecounts-raw sorted by page (years 2007 – 2016)
This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, ordered by (project, page, timestamp)
. It has been created by processing Wikimedia’s pagecounts-raw
dataset.
The original dataset holds the desktop sites’ pageview data (separately for every page) for the period from December 2007 to July 2016, with hourly granularity for all Wikipedia editions. More info about the original dataset are available on Wikitech. Note that these are not unique visits.
Descriptions of the data
The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:
project
: the project namepage
: the page requested, url-escapedtimestamp
: the timestamp of the hour (format:%Y%m%d-%H%M%S
)count
: the number of times the page has been requested (in that hour)bytes
: the number of bytes transferred (in that hour)
Processing
The original dataset has been normalized in the following ways:
- the
project
column has been converted to lowercase - the
page
columns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:
- if two lines are now equal because of this normalization, their
count
andbytes
columns are summed up.
This dataset is split in many gzip’d files, each of them containing 1,000,000 records.
An index
folder is included: for every month there is a file describing the the first record of every partial file.
Sample
Here’s an excerpt of the file pagecounts-20071210-000000.gz
(lines 379737–379752):
and in the processed dataset in file 2007-12/part-00082.gz
(lines 352686–352695) you can find:
Download
This dataset can be downloaded in two different ways:
HTTP (preferred method)
You can find the dataset on: cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/
.
You can use the scripts at pagecounts-download-tools
on GitHub.
How-to download a month worth of data
-
clone the repository:
╭─ ~ ╰─$ git clone https://github.com/CristianCantoro/pagecounts-download-tools
-
go to the `sizes` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/sizes ╰─$ ./download_sizes.sh http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/
-
go to the `downloadlists` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/downloadlists ╰─$ ./make_lists.sh ../sizes/2007-12.txt http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/
-
from the repository base directory and dowload files:
╭─ ~/pagecounts-download-tools ╰─$ ./make_lists.sh ./download.sh -d 2007 1
dat (experimental)
You can download the dataset using dat
, the dataset is available at datbase.org/CristianCantoro/pagecounts-raw-sorted
.
Once you have installed dat
, you can download the dataset with:
dat clone dat://ddc54f744855b022df8edaf458e471757513238282a8675e1ac85f2e14a51b90 ~/dat-pagecounts-raw-sorted
Code
- This dataset has been produced using Apache Spark on Microsoft Azure with this script by Alessio Bogon.
- The Python module pagecounts-search provides a command-line utility to query this dataset.
- The repository wikipedia-pageviews-extraction contains a collection of utility to extract pageviews data for groups of articles taking into account the existence of redirects.
Authors
- Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
- Alberto Montresor, DISI - University of Trento
License
The original dataset was published in the Public Domain (Public Domain Mark 1.0).
You can reuse this dataset under the same license.
How to cite
Please cite this dataset as:
Cristian Consonni, Alberto Montresor. Wikipedia pagecounts-raw sorted by article. doi coming soon
This dataset superseedes the previous version with just the data from 2014: doi:10.6084/m9.figshare.2085643.v1, see also datasets/wikipedia-pagecounts-sorted-by-page-year-2014.
FAQs
What is the total size of the dataset, the number of files and the largest file in the dataset?
The total dataset size is 3.5TB, and it contains ~ 76,000 files. The average size is 45 MB and the largest file is 1.4GB.
How are files organized?
Files are divided in directories, one for each month, like this:
The average directory size is ~34 GB, and on average each contains 731 part files.
Who produced this dataset and why?
The dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy.
This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program.
This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
Is this dataset used in currently in-review or future papers that cite it?
This dataset has been used for multiple papers that are currently in-review or in preparation.
This dataset is published as part of the ENGINEROOM project and it will be cited in the related scientific publications.
Questions?
For further info send me an e-mail.