Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement digippres.org changes and min sig length #253

Merged

Conversation

ross-spencer
Copy link
Collaborator

@ross-spencer ross-spencer commented Jul 13, 2024

A filter has been added to the default query to reduce the number of results relevant to us. This reduces the Wikidata query service query time as well as the amount of time required to generate provenance.

Additionally, because of TrID issues, we implement a minimum signature length that reduces time even further.

Connected to: ffdev-info/wikidp-issues#32
Connected to: #183
Connected to: ffdev-info/wikidp-issues#38
Co-authored-by: @anjackson

nb. default roy is not able to harvest from Wikidata on my computer anymore, at least not within 10 mins. Maybe if left longer.

With filter

time ./roy harvest --wikidata
2024/07/13 14:10:48 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2024/07/13 14:10:48 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2024/07/13 14:10:48 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2024/07/13 14:15:37 Roy (Wikidata): Harvesting Wikidata definitions '/home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0' complete

real	4m48.907s
user	0m10.730s
sys	0m2.906s
./roy build -wikidata
2024/07/13 15:21:55 Roy (Wikidata): Congratulations: doing something with the Wikidata identifier package!
2024/07/13 15:21:55 Roy (Wikidata): Opening Wikidata definitions: /home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0
2024/07/13 15:22:40 {
  "AllSparqlResults": 16749,
  "CondensedSparqlResults": 13720,
  "SparqlRowsWithSigs": 10609,
  "RecordsWithPotentialSignatures": 9133,
  "FormatsWithBadHeuristics": 60,
  "RecordsWithSignatures": 9073,
  "MultipleSequences": 12,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 253,
  "RecordCountWithLintingMessages": 199
}
2024/07/13 15:22:40 Roy (Wikidata): Building identifiers set from PRONOM
2024/07/13 15:22:50 Roy (Wikidata): In Infos()... length formats: '13720' no-pronom: 'false'
2024/07/13 15:22:50 Roy (Wikidata): Adding Glob signatures to identifier...
2024/07/13 15:22:50 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:22:51 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:22:52 Roy (Wikidata): Adding Wikidata Byte signatures to identifier...

With siglen min 6 (3 bytes) signature length:

time ./roy harvest -wikidata
2024/07/13 15:54:37 Roy (Wikidata): Harvesting Wikidata definitions: lang 'en'
2024/07/13 15:54:37 Roy (Wikidata): Harvesting definitions from: 'https://query.wikidata.org/sparql'
2024/07/13 15:54:37 Roy (Wikidata): Harvesting revision history from: 'https://www.wikidata.org/'
2024/07/13 15:57:21 Roy (Wikidata): Harvesting Wikidata definitions '/home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0' complete

real	2m43.534s
user	0m6.724s
sys	0m1.961s
./roy build -wikidata
2024/07/13 15:57:54 Roy (Wikidata): Congratulations: doing something with the Wikidata identifier package!
2024/07/13 15:57:54 Roy (Wikidata): Opening Wikidata definitions: /home/r0ss/siegfried/wikidata/wikidata-definitions-3.0.0
2024/07/13 15:58:09 {
  "AllSparqlResults": 9284,
  "CondensedSparqlResults": 8118,
  "SparqlRowsWithSigs": 9284,
  "RecordsWithPotentialSignatures": 8118,
  "FormatsWithBadHeuristics": 46,
  "RecordsWithSignatures": 8072,
  "MultipleSequences": 12,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 195,
  "RecordCountWithLintingMessages": 156
}
2024/07/13 15:58:09 Roy (Wikidata): Building identifiers set from PRONOM
2024/07/13 15:58:19 Roy (Wikidata): In Infos()... length formats: '8118' no-pronom: 'false'
2024/07/13 15:58:19 Roy (Wikidata): Adding Glob signatures to identifier...
2024/07/13 15:58:19 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:58:19 Roy (Wikidata): Adding container signatures to identifier...
2024/07/13 15:58:20 Roy (Wikidata): Adding Wikidata Byte signatures to identifier...

ross-spencer and others added 3 commits July 13, 2024 15:24
A filter is introduced to return more relevant results, rather than
nearly everything on Wikidata. The query also introduces file format
family signatures which can be handled equally by Siegfried and fills
some gaps in identification.

Co-authored-by: Andrew Jackson <[email protected]>
This reduces the query time further and tackles some of the issues
in: ffdev-info/wikidp-issues#32

The signature length can be configured but the default while we
calibrate is '6', i.e. 3 bytes.
@ross-spencer
Copy link
Collaborator Author

@richardlehane cc. @thorsted -- is there any chance we can build a release candidate with these changes to trial them?

@richardlehane richardlehane changed the base branch from main to develop July 15, 2024 02:27
@richardlehane richardlehane merged commit 09ca73a into richardlehane:develop Jul 15, 2024
12 checks passed
@richardlehane
Copy link
Owner

@ross-spencer @thorsted these changes now on the develop branch and built as a release candidate (Version 1.11.2-rc0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants