-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error running roy harvest -wikidata
#183
Comments
seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with? |
that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ. Some notes on what I wrote to Tyler:
Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data. In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: #178 (PR just needs review and (and fixes) and merging). EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked. |
thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work! |
ah, thanks @EG-tech 🙂 |
|
NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast. |
cc. @thorsted Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below: (I don't think this is the way to go but it's useful to know about) |
@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance? SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
# Return records of type File Format or File Format Family (via instance or subclass chain):
{ ?uri wdt:P31/wdt:P279* wd:Q235557 }.
# Only return records that have at least one useful format identifier
FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.
OPTIONAL { ?uri wdt:P2748 ?puid. } # PUID is used to map to PRONOM signatures
OPTIONAL { ?uri wdt:P1195 ?extension. } # File extension
OPTIONAL { ?uri wdt:P1163 ?mimetype. } # IANA Media Type
OPTIONAL { ?uri p:P4152 ?object; # Format identification pattern statement
OPTIONAL { ?object pq:P3294 ?encoding. } # We don't always have an encoding
OPTIONAL { ?object ps:P4152 ?sig. } # We always have a signature
OPTIONAL { ?object pq:P2210 ?relativity. } # Relativity to beginning or end of file
OPTIONAL { ?object pq:P4153 ?offset. } # Offset relative to the relativity
OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
OPTIONAL { ?provenance pr:P248 ?reference;
pr:P813 ?date.
}
}
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri @thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably? (I can create a test binary too) |
nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. |
Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The But the |
@anjackson there was some explanation of these patterns here ffdev-info/wikidp-issues#24 (comment) via @BertrandCaron that may be helpful? re: the PSD issue, this is why you included the UNION of file format family? did it work? |
@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of |
FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/ |
I'm trying out the instructions here and am getting the following error/output when trying to run
$ roy harvest -wikidata
to start off:I'm on Ubuntu 20.04 with the latest siegfried release (
1.9.2
), is there something obvious I'm doing wrong? (@ross-spencer?)The text was updated successfully, but these errors were encountered: