Datasets: Wikipedia pagecounts-all-sites sorted by page (years 2014 – 2016)
This dataset consists of hourly pagecounts for Wikipedia pages sorted by article, ordered by (project, page, timestamp)
. It has been created by processing Wikimedia’s pagecounts-all-sites
dataset.
The original dataset holds output from September, 2014 to August, 2016 that mimics pagecounts-raw
files, but gets generated from Hadoop data using Hive. The original dataset holds the desktop sites’ pageview data with hourly granularity for all Wikipedia editions. More info about the original dataset are available on Wikitech. Note that these are not unique visits. Note that this dataset files have a one hour shift later than any other dataset handled by the analytics-team (particularly webrequest, pageview-hourly, projectview-hourly). For instance for data between 2018-09-27T13:00:00
and 2018-09-27T14:00:00
, pagecounts-all-sites uses 2018-09-27T14:00:00
while other dataset uses 2018-09-27T13:00:00
.
The CSV uses spaces as delimiter, without any form of escaping because it is not needed. It has 5 columns:
project
: the project namepage
: the page requested, url-escapedtimestamp
: the timestamp of the hour (format:%Y%m%d-%H%M%S
)count
: the number of times the page has been requested (in that hour)bytes
: the number of bytes transferred (in that hour)
The original dataset has been normalized in the following ways:
- the
project
column has been converted to lowercase - the
page
columns has been unquoted and then re-quoted according to RFC 1308, using the following equivalent Python 3 code:
- if two lines are now equal because of this normalization, their
count
andbytes
columns are summed up.
This dataset is split in many gzip’d files, each of them containing 1,000,000 records.
An index
folder is included: for every month there is a file describing the the first record of every partial file.
Here’s an excerpt of the file pagecounts-20140923-010000.gz
(lines 1612915–1612924):
and in the processed dataset in file 2014-09/part-00082.gz
(lines 9079271–9079280) you can find:.
Download
This dataset can be downloaded in two different ways:
HTTP
You can find the dataset on: cricca.disi.unitn.it/datasets/pagecounts-all-sites-sorted/
.
You can use the scripts at pagecounts-download-tools
on GitHub.
How-to download a month worth of data
-
clone the repository:
╭─ ~ ╰─$ git clone https://github.com/CristianCantoro/pagecounts-download-tools
-
go to the `sizes` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/sizes ╰─$ ./download_sizes.sh http://cricca.disi.unitn.it/datasets/pagecounts-all-sites-sorted/
-
go to the `downloadlists` directory and execute the download sizes:
╭─ ~/pagecounts-download-tools/downloadlists ╰─$ ./make_lists.sh ../sizes/2014-09.txt http://cricca.disi.unitn.it/datasets/pagecounts-all-sites-sorted/
-
from the repository base directory and dowload files:
╭─ ~/pagecounts-download-tools ╰─$ ./make_lists.sh ./download.sh -d 2014 9
dat (experimental)
You can download the dataset using dat
, the dataset is available at datbase.org/CristianCantoro/wikipedia-pagecounts-all-sites-sorted
.
Once you have installed dat
, you can download the dataset with:
dat clone dat://d4ac75cda06e991b3181abb7365a1761581c2d54e962f14015f52ed5c8e9f6b2 ~/dat-wikipedia-pagecounts-all-sites-sorted
Code
- This dataset has been produced using Apache Spark on Microsoft Azure with this script by Alessio Bogon.
- The Python module pagecounts-search provides a command-line utility to query this dataset.
- The repository wikipedia-pageviews-extraction contains a collection of utility to extract pageviews data for groups of articles taking into account the existence of redirects.
Authors
- Cristian Consonni (cristian.consonni(at)unitn.it), DISI - University of Trento
- Alberto Montresor, DISI - University of Trento
License
The original dataset was published in the Public Domain (Public Domain Mark 1.0).
You can reuse this dataset under the same license.
How to cite
Please cite this dataset as:
Cristian Consonni, Alberto Montresor. Wikipedia pagecounts-raw sorted by article. doi coming soon
This dataset superseedes the previous version with just the data from 2014: doi:10.6084/m9.figshare.2085643.v1, see also datasets/wikipedia-pagecounts-sorted-by-page-year-2014.
FAQs
What is the total size of the dataset, the number of files and the largest file in the dataset?
The total dataset size is 1.1TB, and it contains ~15,800 files. The average size is 35 MB and the largest file is 1.22GB.
How are files organized?
Files are divided in directories, one for each month, like this:
The average directory size is ~45GB, and on average each contains 657 part files.
Who produced this dataset and why?
The dataset has been produced by Cristian Consonni and Alberto Montresor, from the Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy.
This research has been supported by Microsoft Azure Research Award CRM:0518942 as part of the “Azure for Research Award: Data Science” program.
This dataset has also been utilized in the research related to the ENGINEROOM project, in collaboration with David Laniado of Eurecat - Centre Tecnològic de Catalunya, Barcelona, Spain. EU ENGINEROOM has received funding from the European Union’s Horizon 2020 research and innovation programme under the Grant Agreement no 780643.
Is this dataset used in currently in-review or future papers that cite it?
This dataset has been used for multiple papers that are currently in-review or in preparation.
This dataset is published as part of the ENGINEROOM project and it will be cited in the related scientific publications.
Questions?
For further info send me an e-mail.