Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of basic Wikidata identifier #138

Merged
merged 3 commits into from
Sep 21, 2020

Conversation

ross-spencer
Copy link
Collaborator

@ross-spencer ross-spencer commented Feb 25, 2020

Implements an identifier based on the information recorded about file formats in Wikidata. (At least, a good first iteration of an identifier).

NB. will drop you an email later tomorrow!

@ross-spencer ross-spencer force-pushed the dev/yul-wikidata-integration branch 2 times, most recently from 66ffbaf to 08bd82a Compare February 25, 2020 03:43
@ross-spencer ross-spencer force-pushed the dev/yul-wikidata-integration branch 3 times, most recently from 0520a3f to 708337c Compare July 16, 2020 04:35
@ross-spencer ross-spencer force-pushed the dev/yul-wikidata-integration branch 8 times, most recently from 5f00812 to bc6923b Compare September 13, 2020 05:58
Wikidata contains a host of file format information. Wikidata records
provide information about format names, identifiers (PUID, LoC FDD),
file-extensions, MIMEtypes, and format identification patterns. This
means it might be quite handy as an extension to Siegfried's
capabilties.

This is the first-cut of that work. We are able to harvest
information from the Wikidata Query Service. Create an identifier.
And consume the information in the identifier to match file-formats
using format-identification patterns taken directly from the
service.

We attempt to return new information not otherwise present in other
identifiers yet, for example, signature provenance. Provenance is
recorded in Wikidata as well as the date a format-identification
pattern was added. We try and replay this to the user to enrich
what is available to them.

This work is graciously supported by Yale University Library
(Euan Cochran, Kat Thornton) and Richard Lehane. It has been fun
trying to put this out there for folk.
Add archive formats to Wikidata configuration so that Siegfried can
attempt to decompress archive formats during processing.

Skeleton tests have been added to the Wikidata work.

IsArchive() handling has been updated to be a little more dynamic,
and theoretically easier to add to for other identifiers. Tests
have also been added for legacy compatiblity.
We provide some additional tests to make sure results come back
from the identifier as expected. With this addition we add checks
for extension mismatches and container pattern matching when the
Wikidata signature is combined with PRONOM.

The wikidata definitions file has been minimized in an attempt to
improve test execution time.
@ross-spencer ross-spencer changed the title WIP: Implementation of basic Wikidata identifier Implementation of basic Wikidata identifier Sep 13, 2020
@ross-spencer ross-spencer marked this pull request as ready for review September 13, 2020 06:16
@ross-spencer ross-spencer self-assigned this Sep 13, 2020
Copy link
Owner

@richardlehane richardlehane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ross - have read thru now and that's a huge amount of work you did!
It all looks pretty good to me.

My one reservation is the "sourceinline" flag: for me, the fewer the flags the better & it would be nicer just to select one preferred of displaying this field. Happy for this flag to stay in for now (so Kat and Euan can try both ways) - but can we drop it in the following release?

Q.: this PR seems to have a bunch of changes for the arc selection too - does this PR wrap that other one too, or should I merge both?

cheers
Richard

@ross-spencer
Copy link
Collaborator Author

Awesome thanks @richardlehane and yeah, I share the same thoughts about the sourceinline flag. It will be great to commit to removing that in the next release. I feel too based on the previous discussions we're on the right path for the default view there so I think that will be possible.

RE: Arc work - the code here lays the foundation for #141. I cherry-picked most of 141 for Wikidata as it just made sense to me that it was all there and easy to incorporate. That being said to provide arc selector capability now 141 does need a quick rebase which I was hoping to do against develop so I can perhaps do that later today? And then folks with the new version can select their archives! Or we can hold off. I think it's just an hour or so needed to bring 141 in-line with a develop branch which incorporates this one.

@richardlehane richardlehane merged commit 205ba0b into develop Sep 21, 2020
@richardlehane
Copy link
Owner

Merged! If you could prep #141 for merging too that'd be great

@ross-spencer ross-spencer deleted the dev/yul-wikidata-integration branch September 21, 2020 23:31
richardlehane added a commit that referenced this pull request Mar 20, 2023
Implementation of basic Wikidata identifier
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants