-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of basic Wikidata identifier #138
Conversation
66ffbaf
to
08bd82a
Compare
7cb9001
to
a619799
Compare
e91027d
to
f537660
Compare
7755778
to
f071fd7
Compare
0520a3f
to
708337c
Compare
708337c
to
db30320
Compare
5f00812
to
bc6923b
Compare
Wikidata contains a host of file format information. Wikidata records provide information about format names, identifiers (PUID, LoC FDD), file-extensions, MIMEtypes, and format identification patterns. This means it might be quite handy as an extension to Siegfried's capabilties. This is the first-cut of that work. We are able to harvest information from the Wikidata Query Service. Create an identifier. And consume the information in the identifier to match file-formats using format-identification patterns taken directly from the service. We attempt to return new information not otherwise present in other identifiers yet, for example, signature provenance. Provenance is recorded in Wikidata as well as the date a format-identification pattern was added. We try and replay this to the user to enrich what is available to them. This work is graciously supported by Yale University Library (Euan Cochran, Kat Thornton) and Richard Lehane. It has been fun trying to put this out there for folk.
Add archive formats to Wikidata configuration so that Siegfried can attempt to decompress archive formats during processing. Skeleton tests have been added to the Wikidata work. IsArchive() handling has been updated to be a little more dynamic, and theoretically easier to add to for other identifiers. Tests have also been added for legacy compatiblity.
We provide some additional tests to make sure results come back from the identifier as expected. With this addition we add checks for extension mismatches and container pattern matching when the Wikidata signature is combined with PRONOM. The wikidata definitions file has been minimized in an attempt to improve test execution time.
bc6923b
to
edc14bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ross - have read thru now and that's a huge amount of work you did!
It all looks pretty good to me.
My one reservation is the "sourceinline" flag: for me, the fewer the flags the better & it would be nicer just to select one preferred of displaying this field. Happy for this flag to stay in for now (so Kat and Euan can try both ways) - but can we drop it in the following release?
Q.: this PR seems to have a bunch of changes for the arc selection too - does this PR wrap that other one too, or should I merge both?
cheers
Richard
Awesome thanks @richardlehane and yeah, I share the same thoughts about the sourceinline flag. It will be great to commit to removing that in the next release. I feel too based on the previous discussions we're on the right path for the default view there so I think that will be possible. RE: Arc work - the code here lays the foundation for #141. I cherry-picked most of 141 for Wikidata as it just made sense to me that it was all there and easy to incorporate. That being said to provide arc selector capability now 141 does need a quick rebase which I was hoping to do against |
Merged! If you could prep #141 for merging too that'd be great |
Implementation of basic Wikidata identifier
Implements an identifier based on the information recorded about file formats in Wikidata. (At least, a good first iteration of an identifier).
NB. will drop you an email later tomorrow!