error running `roy harvest -wikidata` #183

EG-tech · 2022-04-27T16:20:51Z

I'm trying out the instructions here and am getting the following error/output when trying to run $ roy harvest -wikidata to start off:

2022/04/27 09:23:21 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2022/04/27 09:23:21 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2022/04/27 09:24:55 Error trying to retrieve SPARQL with revision history: warning: there were errors retrieving provenance from Wikibase API: wikiprov: unexpected response from server: 429

I'm on Ubuntu 20.04 with the latest siegfried release (1.9.2), is there something obvious I'm doing wrong? (@ross-spencer?)

The text was updated successfully, but these errors were encountered:

EG-tech · 2022-04-27T17:12:14Z

seeing/learning that error code 429 has to do with too many requests hitting the server from the client - is there a way to rate limit the request from roy? or another way to get a Wikidata signature file to start with?

ross-spencer · 2022-04-27T17:19:29Z

that's exactly it @EG-tech. Tyler had the same a while back. (Via email request so not on Github). We might need to put it into the FAQ.

Some notes on what I wrote to Tyler:

The Wikidata documentation on the query service (WDQS) is here but it's not very clear, i.e. it talks about processing time, not how that translates to some large queries:

https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits

I have known that it is a risk that this might happen, though I can't quantify for how many and when. There are times for
example when I have been testing where I have run the query in-upwards of 30 times in a day.

We set a custom header for the request which should be recognized by WDQS and prevent this issue somewhat - it is more friendly to known user-agents for example than unknown ones.

Long-term something approaching rate limiting may work. Right now it's just a single request asking for a lot of data.

In the short-to-medium term, this pull request should mean you can grab an identifier from Richard's itforarchivists server and it will let you get up and running: #178 (PR just needs review and (and fixes) and merging).

EDIT: NB. For Tyler, he just tried it later in the day or next morning and it worked.

EG-tech · 2022-04-27T17:28:47Z

thanks @ross-spencer!! that all makes sense, thanks for confirming and I'll play with your suggestion when I get the chance. amazing work!

ross-spencer · 2022-04-27T17:39:26Z

ah, thanks @EG-tech 🙂

ross-spencer · 2022-05-24T06:17:51Z

-update flag now supports Wikidata which should provide a workaround for most facing this issue, there's an underlying reliability issue that might still be solved here as per above.

Changelog for 1.9.3.

ross-spencer · 2023-02-24T08:42:03Z

NB. Just to report, we are still seeing this issue in places. I haven't been able to determine when it is likely a harvest call is likely to be successful other than, it seems to work better in Europe than on the US West-coast.

ross-spencer · 2023-04-06T12:20:37Z

cc. @thorsted

Someone reached out at the last talk I gave about the Wikidata integration - specifically about long-running queries. I discovered this was because they run a mirror without timeouts for a cost per query. Their service and another example are linked to below:

(I don't think this is the way to go but it's useful to know about)

ross-spencer · 2024-07-11T09:47:49Z

@anjackson just updated our SPARQL query on the digipres format explorer, the magic is in the FILTER expression, and cuts results from 70,000 to 17,000 (approx.) worth a try to see if it improves performance?

SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
WHERE
{
  # Return records of type File Format or File Format Family (via instance or subclass chain):
  { ?uri wdt:P31/wdt:P279* wd:Q235557 }.
      
  # Only return records that have at least one useful format identifier
  FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163 [] }.       
  
  OPTIONAL { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures
  OPTIONAL { ?uri wdt:P1195 ?extension. }          # File extension
  OPTIONAL { ?uri wdt:P1163 ?mimetype.  }          # IANA Media Type
  OPTIONAL { ?uri p:P4152 ?object;                 # Format identification pattern statement
    OPTIONAL { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding
    OPTIONAL { ?object ps:P4152 ?sig.        }     # We always have a signature
    OPTIONAL { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file
    OPTIONAL { ?object pq:P4153 ?offset.     }     # Offset relative to the relativity
    OPTIONAL { ?object prov:wasDerivedFrom ?provenance;
       OPTIONAL { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY ?uri

@thorsted have you tried the custom sparql technique? https://github.com/richardlehane/siegfried/wiki/Wikidata-identifier#using-the-custom-wikibase-functionality-for-wikidata <-- any chance you could try this sparql above to see if it returns more reliably?

(I can create a test binary too)

via digipres/digipres.github.io#48 (comment)

ross-spencer · 2024-07-11T10:02:21Z

nb. although, this query needs a PUID, or MIMEType, or Extension, and there might be wikidata records without these, so maybe we need to add in sig... e.g. FILTER EXISTS { ?uri wdt:P2748|wdt:P1195|wdt:P1163|ps:P4152 [] }.

anjackson · 2024-07-11T10:09:55Z

Thanks @ross-spencer - FWIW I'm in the midst of writing things up and I'm not at all sure I'm quite there yet. The wdt:P31/wdt:P279* wd:Q235557 seems to be missing out some records (e.g. no *.psd!) , and I'm seeing different variations in different places (wdt:P31*/wdt:P279*, p:P31/ps:P31/wdt:P279*) which I can't say I fully understand at this point.

But the FILTER thing seems to help with the overall size/performance.

ross-spencer · 2024-07-11T10:17:45Z

@anjackson there was some explanation of these patterns here ffdev-info/wikidp-issues#24 (comment) via @BertrandCaron that may be helpful?

re: the PSD issue, this is why you included the UNION of file format family? did it work?

anjackson · 2024-07-11T10:50:51Z

@ross-spencer Yes, adding that UNION brought in PSD, which is declared as a instance of File Format Family but not of File Format (other families appear to be explicitly declared as instances of both). But so did using P31* instead of UNION, as a File Format Family is an instance of a File Format. At the time of writing, UNION matches 69,961 (un FILTERed records) and P31* matches 70,363 so something else is going on too. This is what I'm attempting to write up.

anjackson · 2024-07-12T13:08:49Z

FWIW, here's what I've written up so far: https://anjackson.net/2024/07/12/finding-formats-in-wikidata/

ross-spencer added the wikidata label Apr 27, 2022

ross-spencer mentioned this issue Apr 27, 2022

Trial and write documentation about whether custom Wikibase instructions can also be used to customize Wikidata queries #184

Closed

ross-spencer mentioned this issue Feb 24, 2023

Add Wikidata to -update (wikidata.sig and deluxe.sig) #178

Merged

ross-spencer mentioned this issue Nov 23, 2023

Harvest fails in the United States ffdev-info/wikidp-issues#38

Open

ross-spencer mentioned this issue Jul 13, 2024

Implement digippres.org changes and min sig length #253

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error running `roy harvest -wikidata` #183

error running `roy harvest -wikidata` #183

EG-tech commented Apr 27, 2022

EG-tech commented Apr 27, 2022

ross-spencer commented Apr 27, 2022 •

edited

Loading

EG-tech commented Apr 27, 2022

ross-spencer commented Apr 27, 2022

ross-spencer commented May 24, 2022

ross-spencer commented Feb 24, 2023

ross-spencer commented Apr 6, 2023

ross-spencer commented Jul 11, 2024

ross-spencer commented Jul 11, 2024 •

edited

Loading

anjackson commented Jul 11, 2024

ross-spencer commented Jul 11, 2024

anjackson commented Jul 11, 2024

anjackson commented Jul 12, 2024

error running roy harvest -wikidata #183

error running roy harvest -wikidata #183

Comments

EG-tech commented Apr 27, 2022

EG-tech commented Apr 27, 2022

ross-spencer commented Apr 27, 2022 • edited Loading

EG-tech commented Apr 27, 2022

ross-spencer commented Apr 27, 2022

ross-spencer commented May 24, 2022

ross-spencer commented Feb 24, 2023

ross-spencer commented Apr 6, 2023

ross-spencer commented Jul 11, 2024

ross-spencer commented Jul 11, 2024 • edited Loading

anjackson commented Jul 11, 2024

ross-spencer commented Jul 11, 2024

anjackson commented Jul 11, 2024

anjackson commented Jul 12, 2024

error running `roy harvest -wikidata` #183

error running `roy harvest -wikidata` #183

ross-spencer commented Apr 27, 2022 •

edited

Loading

ross-spencer commented Jul 11, 2024 •

edited

Loading