Wikipedia Pageview Counter

This repo contains code to take hourly page view count data from Wikipedia, and construct page counts for longer periods using PostgreSQL.

Note: This requires PostgreSQL 9.5 to take advantage of the new UPSERT feature, or it's just too slow

How it works

Hourly page view files are downloaded on demand (or you can download them yourself and read directly from the files), a Node script parses the files and imports relevant data into Postgres, and finally fun queries can be run against the resulting data.

Requirements

So far this has been used only to aggregate a single month of Wikipedia logs.

64GB of disk space and/or an internet connection fast enough to download 64GB, to store one months worth of logs
Another 35GB or so of disk space to be used by Postgres
About a day of processing on a modern quad core CPU

Instructions

Set up a PostgreSQL 9.5 beta1 or newer instance (with PostGIS)
Download the Wikipedia world PostGIS dump from here
Gernerate a list of pagecount files to download (currently manual, a sample for Sept 2015 is included)
Read and modify crunch.sh to suit your needs, run it, and wait
View the output of top viewed pages with location data that is calculated automatically
Run other interesting queries and report back!

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
old		old
sql		sql
README.md		README.md
crunch.sh		crunch.sh
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia Pageview Counter

How it works

Requirements

Instructions

About

Releases

Packages

Languages

mapzen/wikipedia-counter

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Pageview Counter

How it works

Requirements

Instructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages