Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out why we can't run the netflix_to_wikidata script on all the movies #12

Open
audiodude opened this issue Sep 13, 2024 · 6 comments
Assignees

Comments

@audiodude
Copy link
Collaborator

It seems to be crashing part of the way through. Let's post the stack trace in this bug and see if we can figure it out.

@audiodude
Copy link
Collaborator Author

Looks like the network requests are just a bit flaky and eventually one gets stuck and times out. Maybe we're getting rate limited (we should look into that).

#13 (comment)

@audiodude
Copy link
Collaborator Author

@audiodude
Copy link
Collaborator Author

Okay, I ran the code with the version in #17, with the exponential backoff and with tqdm showing a progress bar. It failed 27% of the way through, but with a different error message than the one we got before:

27%|█████████████████▉                                                | 4824/17770 [28:09<1:15:33,  2.86it/s]
Traceback (most recent call last):
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 152, in <module>
    process_data(False)
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 137, in process_data
    wiki_movie_ids_list, wiki_genres_list, wiki_directors_list = wiki_query(netflix_file, user_agent)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tmoney/code/MediaBridge/src/data_processing/wiki_to_netflix.py", line 118, in wiki_query
    response.raise_for_status()
  File "/home/tmoney/.local/share/virtualenvs/MediaBridge-QSS_14Zx/lib/python3.12/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://query.wikidata.org/sparql

A 400 error means that there was something wrong with the request, with what we sent Wikidata. My guess is that there is a movie name like "Face/Off" that has a slash in the name or something, and it's not getting properly escaped, which would make the SPARQL invalid. Picture a movie like Joe's "Magical" Adventure. When put into the SPARQL it would be:

                                mwapi:search "%(Title)s" ;

which would turn into

                                mwapi:search "Joe's "Magical" Adventure" ;

And the quotes would be messed up. We know the error happens at or around item 4824 in the data, so we should just be able to look at the movie titles at that line and figure out what's going on.

Please checkout #17 and run the code and try to figure out what's going on.

@audiodude
Copy link
Collaborator Author

Okay I was curious. I changed the iteration to:

    for row in tqdm(data_csv[4820:]):

and added error handling around raise_for_status():

        try:
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            print(repr(row))
            raise e

And got this:

['4825', '1985', 'Brazil: The "Love Conquers All" Version']

Putting aside that this is unlikely to match any movies anyways, we should either:

  1. Skip trying to match any movies with quotes
  2. Properly escape the quotes (will need to lookup how SPARQL does that)

@audiodude
Copy link
Collaborator Author

audiodude commented Oct 1, 2024

Here's a few more that didn't work from my handling of the 400 error:

['4825', '1985', 'Brazil: The "Love Conquers All" Version']
 33%|██████████████████████▌                                             | 5912/17770 [34:24<58:59,  3.35it/s]['5913', '1994',
'Snowy River: The McGregor Saga "The Race"']
 35%|███████████████████████▏                                          | 6240/17770 [36:22<1:25:52,  2.24it/s]['6241', '1965',
'Operation "Y" and Other Shurik\'s Adventures']
 35%|███████████████████████▎                                          | 6279/17770 [36:36<1:01:23,  3.12it/s]['6280', '2003',
'Sting: Inside the Songs of "Sacred Love"']
 49%|█████████████████████████████████▍                                  | 8747/17770 [51:12<48:19,  3.11it/s
['8748', '2004', 'Morrissey: Who Put the "M" in Manchester']
 49%|█████████████████████████████████▍                                  | 8748/17770 [51:12<52:48,  2.85it/s]

@audiodude
Copy link
Collaborator Author

audiodude commented Dec 13, 2024

We should just run the entire thing, on all movies (should take about 90 minutes) and make sure this is no longer an issue, then close.

To be clear, if we can download query data for all movies, this isn't an issue and can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants