-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate pageviews for all articles across all wikipedias #755
Conversation
Hi @benoit74 , can you please take a look at this PR? Thanks! |
77b9e1a
to
7816048
Compare
Planning on comitting with codefactor issues. I tried ignoring them in the codefactor GUI, and also applying |
e4131fa
to
455b45f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #755 +/- ##
==========================================
+ Coverage 90.99% 91.14% +0.14%
==========================================
Files 65 66 +1
Lines 3411 3546 +135
==========================================
+ Hits 3104 3232 +128
- Misses 307 314 +7 ☔ View full report in Codecov by Sentry. |
0ca37c1
to
574bead
Compare
574bead
to
bfe4d6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
def wiki_languages(): | ||
r = requests.get( | ||
'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv', | ||
headers={'User-Agent': WP1_USER_AGENT}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always include a timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
os.remove(prev_filepath) | ||
|
||
cur_filepath = get_cur_file_path() | ||
if os.path.exists(cur_filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When downloading below, you're writing to the actual file. Should there be any issue, a partial file will still be on disk without possibility to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error handling in the download, with corresponding test, PTAL.
wp1/scores.py
Outdated
# File already downloaded | ||
return | ||
|
||
with requests.get(get_pageview_url(), stream=True) as r: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
wp1/scores.py
Outdated
r.raise_for_status() | ||
with open(cur_filepath, 'wb') as f: | ||
# Read data in 8 KB chunks | ||
for chunk in r.iter_content(chunk_size=8 * 1024): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8KB is ridiculously small for downloading 3GB. I think you can afford more RAM than this. Will surely boost download speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you suggest, maybe 1 MB? 8 MB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8MB seems fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! PTAL.
def wiki_languages(): | ||
r = requests.get( | ||
'https://wikistats.wmcloud.org/api.php?action=dump&table=wikipedias&format=csv', | ||
headers={'User-Agent': WP1_USER_AGENT}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
wp1/scores.py
Outdated
# File already downloaded | ||
return | ||
|
||
with requests.get(get_pageview_url(), stream=True) as r: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
os.remove(prev_filepath) | ||
|
||
cur_filepath = get_cur_file_path() | ||
if os.path.exists(cur_filepath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added error handling in the download, with corresponding test, PTAL.
@audiodude With current solution, i keep track of pageviews over many years in both gathering and score computation. Reason is to mitigate the impact of local maximum because of recent event. |
@kelson42 let's merge this as is and we can discuss some kind of backfill, or keeping a running tally. |
- Add WP1_USER_AGENT to http calls - Get pageviews for previous month instead of hardcoded time - Throw error on non-successfuly HTTP status - Allow a filter_lang when updating DB from pageviews - Only commit db after every 10000 rows
44d5991
to
c44be71
Compare
For #171
Here we download the pageviews text file for the previous month and ingest it, storing the view counts for every article, across every wikipedia, that has a view count.
This PR includes the schema for the
page_scores
table, which will eventually include more than just the page views, also the links and lang_links that make up the components of the page score used in classic selection.Previous attempts at this logic involved streaming the pageviews bz2 file over HTTP, however this was found to be unreliable because the remote server would close the connection after ~20 minutes. Instead, we create a tempfile location, which in production is mapped to /srv/, and download the entire file there. The BZ2 decompression and line by line processing is still streamed.
EDIT: There is currently no scheduling for this process because we need to run it manually a couple of times first to make sure it works. Scheduling will be added in a future PR.