Implementation of basic Wikidata identifier #138

ross-spencer · 2020-02-25T03:25:59Z

Implements an identifier based on the information recorded about file formats in Wikidata. (At least, a good first iteration of an identifier).

NB. will drop you an email later tomorrow!

Wikidata contains a host of file format information. Wikidata records provide information about format names, identifiers (PUID, LoC FDD), file-extensions, MIMEtypes, and format identification patterns. This means it might be quite handy as an extension to Siegfried's capabilties. This is the first-cut of that work. We are able to harvest information from the Wikidata Query Service. Create an identifier. And consume the information in the identifier to match file-formats using format-identification patterns taken directly from the service. We attempt to return new information not otherwise present in other identifiers yet, for example, signature provenance. Provenance is recorded in Wikidata as well as the date a format-identification pattern was added. We try and replay this to the user to enrich what is available to them. This work is graciously supported by Yale University Library (Euan Cochran, Kat Thornton) and Richard Lehane. It has been fun trying to put this out there for folk.

Add archive formats to Wikidata configuration so that Siegfried can attempt to decompress archive formats during processing. Skeleton tests have been added to the Wikidata work. IsArchive() handling has been updated to be a little more dynamic, and theoretically easier to add to for other identifiers. Tests have also been added for legacy compatiblity.

We provide some additional tests to make sure results come back from the identifier as expected. With this addition we add checks for extension mismatches and container pattern matching when the Wikidata signature is combined with PRONOM. The wikidata definitions file has been minimized in an attempt to improve test execution time.

richardlehane

Hi Ross - have read thru now and that's a huge amount of work you did!
It all looks pretty good to me.

My one reservation is the "sourceinline" flag: for me, the fewer the flags the better & it would be nicer just to select one preferred of displaying this field. Happy for this flag to stay in for now (so Kat and Euan can try both ways) - but can we drop it in the following release?

Q.: this PR seems to have a bunch of changes for the arc selection too - does this PR wrap that other one too, or should I merge both?

cheers
Richard

ross-spencer · 2020-09-21T13:51:01Z

Awesome thanks @richardlehane and yeah, I share the same thoughts about the sourceinline flag. It will be great to commit to removing that in the next release. I feel too based on the previous discussions we're on the right path for the default view there so I think that will be possible.

RE: Arc work - the code here lays the foundation for #141. I cherry-picked most of 141 for Wikidata as it just made sense to me that it was all there and easy to incorporate. That being said to provide arc selector capability now 141 does need a quick rebase which I was hoping to do against develop so I can perhaps do that later today? And then folks with the new version can select their archives! Or we can hold off. I think it's just an hour or so needed to bring 141 in-line with a develop branch which incorporates this one.

richardlehane · 2020-09-21T14:45:37Z

Merged! If you could prep #141 for merging too that'd be great

Implementation of basic Wikidata identifier

ross-spencer force-pushed the dev/yul-wikidata-integration branch 2 times, most recently from 66ffbaf to 08bd82a Compare February 25, 2020 03:43

ross-spencer force-pushed the dev/yul-wikidata-integration branch from 7cb9001 to a619799 Compare April 15, 2020 02:12

ross-spencer force-pushed the dev/yul-wikidata-integration branch from e91027d to f537660 Compare April 23, 2020 04:56

ross-spencer force-pushed the dev/yul-wikidata-integration branch from 7755778 to f071fd7 Compare June 22, 2020 00:38

ross-spencer force-pushed the dev/yul-wikidata-integration branch 3 times, most recently from 0520a3f to 708337c Compare July 16, 2020 04:35

ross-spencer force-pushed the dev/yul-wikidata-integration branch from 708337c to db30320 Compare August 9, 2020 05:32

ross-spencer force-pushed the dev/yul-wikidata-integration branch 8 times, most recently from 5f00812 to bc6923b Compare September 13, 2020 05:58

ross-spencer added 3 commits September 13, 2020 02:10

ross-spencer changed the title ~~WIP: Implementation of basic Wikidata identifier~~ Implementation of basic Wikidata identifier Sep 13, 2020

ross-spencer force-pushed the dev/yul-wikidata-integration branch from bc6923b to edc14bb Compare September 13, 2020 06:14

ross-spencer marked this pull request as ready for review September 13, 2020 06:16

ross-spencer requested a review from richardlehane September 13, 2020 06:17

ross-spencer self-assigned this Sep 13, 2020

richardlehane approved these changes Sep 21, 2020

View reviewed changes

richardlehane merged commit 205ba0b into develop Sep 21, 2020

ross-spencer deleted the dev/yul-wikidata-integration branch September 21, 2020 23:31

richardlehane added a commit that referenced this pull request Mar 20, 2023

Merge pull request #138 from richardlehane/dev/yul-wikidata-integration

a2144b3

Implementation of basic Wikidata identifier

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of basic Wikidata identifier #138

Implementation of basic Wikidata identifier #138

ross-spencer commented Feb 25, 2020 •

edited

Loading

richardlehane left a comment

ross-spencer commented Sep 21, 2020

richardlehane commented Sep 21, 2020

Implementation of basic Wikidata identifier #138

Implementation of basic Wikidata identifier #138

Conversation

ross-spencer commented Feb 25, 2020 • edited Loading

richardlehane left a comment

Choose a reason for hiding this comment

ross-spencer commented Sep 21, 2020

richardlehane commented Sep 21, 2020

ross-spencer commented Feb 25, 2020 •

edited

Loading