Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error running roy harvest -wikidata #183

Open
EG-tech opened this issue Apr 27, 2022 · 13 comments
Open

error running roy harvest -wikidata #183

EG-tech opened this issue Apr 27, 2022 · 13 comments
Labels

Comments

@EG-tech
Copy link

EG-tech commented Apr 27, 2022

I'm trying out the instructions here and am getting the following error/output when trying to run $ roy harvest -wikidata to start off:

2022/04/27 09:23:21 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2022/04/27 09:24:55 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

I'm on Ubuntu 20.04 with the latest siegfried release (1.9.2), is there something obvious I'm doing wrong? (@ross-spencer?)

@EG-tech
Copy link
Author

EG-tech commented Apr 27, 2022

seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with?

@ross-spencer
Copy link
Collaborator

ross-spencer commented Apr 27, 2022

that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ.

Some notes on what I wrote to Tyler:

The Wikidata documentation on the query service (WDQS) is here but it's not very clear, i.e. it talks about processing time, not how that translates to some large queries:

https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits

I have known that it is a risk that this might happen, though I can't quantify for how many and when. There are times for
example when I have been testing where I have run the query in-upwards of 30 times in a day.

We set a custom header for the request which should be recognized by WDQS and prevent this issue somewhat - it is more friendly to known user-agents for example than unknown ones.

Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data.

In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: #178 (PR just needs review and (and fixes) and merging).

EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked.

@EG-tech
Copy link
Author

EG-tech commented Apr 27, 2022

thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work!

@ross-spencer
Copy link
Collaborator

ah, thanks @EG-tech 🙂

@ross-spencer
Copy link
Collaborator

-update flag now supports Wikidata which should provide a workaround for most facing this issue, there's an underlying reliability issue that might still be solved here as per above.

@ross-spencer
Copy link
Collaborator

NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast.

@ross-spencer
Copy link
Collaborator

cc. @thorsted

Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below:

(I don't think this is the way to go but it's useful to know about)

@ross-spencer
Copy link
Collaborator

@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance?

SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
  # Return records of type File Format or File Format Family (via instance or subclass chain):
  { ?uri wdt:P31/wdt:P279* wd:Q235557 }.
      
  # Only return records that have at least one useful format identifier
  FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.       
  
  OPTIONAL { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures
  OPTIONAL { ?uri wdt:P1195 ?extension. }          # File extension
  OPTIONAL { ?uri wdt:P1163 ?mimetype.  }          # IANA Media Type
  OPTIONAL { ?uri p:P4152 ?object;                 # Format identification pattern statement
    OPTIONAL { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding
    OPTIONAL { ?object ps:P4152 ?sig.        }     # We always have a signature
    OPTIONAL { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file
    OPTIONAL { ?object pq:P4153 ?offset.     }     # Offset relative to the relativity
    OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
       OPTIONAL { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri

@thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably?

(I can create a test binary too)

via digipres/digipres.github.io#48 (comment)

@ross-spencer
Copy link
Collaborator

ross-spencer commented Jul 11, 2024

nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152 [] }.

@anjackson
Copy link
Contributor

Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The wdt:P31/wdt:P279* wd:Q235557 seems to be missing out some records (e.g. no *.psd!) , and I'm seeing different variations in different places (wdt:P31*/wdt:P279*, p:P31/ps:P31/wdt:P279*) which I can't say I fully understand at this point.

But the FILTER thing seems to help with the overall size/performance.

@ross-spencer
Copy link
Collaborator

@anjackson there was some explanation of these patterns here ffdev-info/wikidp-issues#24 (comment) via @BertrandCaron that may be helpful?

re: the PSD issue, this is why you included the UNION of file format family? did it work?

@anjackson
Copy link
Contributor

@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of File Format Family but not of File Format (other families appear to be explicitly declared as instances of both). But so did using P31* instead of UNION, as a File Format Family is an instance of a File Format. At the time of writing, UNION matches 69,961 (un FILTERed records) and P31* matches 70,363 so something else is going on too. This is what I'm attempting to write up.

@anjackson
Copy link
Contributor

FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants