-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
for text matches, return more positive PRONOM IDs where these are available #114
Comments
I'm looking at this now and it will be a little but untidy. My text detection routine (which is based on the file tool's algo) returns these text types:
These don't map cleanly to PRONOM IDs. PRONOM has these x-fmt IDs for various text types:
These are all outline records and their meaning is a bit ambiguous. E.g. the ASCII that my text routine returns is ASCII proper (i.e. in the 0-127 byte range), so we could link it to x-fmt/22. But what is 8-bit ASCII? Does it map to Extended (i.e. extended Mac and IBM PC ASCII)? What is ANSI? Wikipedia suggests ANSI has no well defined meaning. Does it map to Latin1? Complicating things further, the PRONOM database has unique IDs for UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE. But the IDs aren't traditional fmt or x-fmt IDs, they are "chr" IDs: chr/1, chr/2 etc. Suggest I could ignore these and just map to x-fmt/16 for the Unicode family? So it won't be a clean mapping, but suggest could do this:
Thoughts anyone? |
x-fmt/111 is a plain-text file and believe this is still a catch-all in Rosetta. I would suggest that this is ASCII - but understand there are specific labels for that in PRONOM. That’s really just a first observation. I think to make x-fmt/111 EBCIDIC would be like overiding a de-facto standard with another.
Does SF have any other signatures that exist and map to PRONOM where DROID doesn’t also have an ability to identify the same information?
|
Worth mentioning that there's fmt/159 "EBCDIC-US". However with all the different EBCDIC codepages one PUID alone won't be enough.
|
thanks both for comments. So, with EBDIC, could do:
But, yes, agree with Ross that a bit weird that now EBCDICINT the only encoding that would return x-fmt/111 plain text, which is the PUID that most users have come to expect text to default to. I should note that sf's current behaviour is to map all these text encodings to x-fmt/111 (and I copied this approach from Archivematica's fido plug-in which runs the file tool on unknowns and marks them as x-fmt/111 if file says text... so it isn't just Rosetta that has adopted x-fmt/111 as the text fall-back ID). Options:
Such an option could be a simple boolean flag if we're agreed on the list above:
Or a roy option could allow users to give their own text-encoding/PUID map like so:
|
Wouldn't it be wise to sort out the format vs. encoding thing in PRONOM first? I wonder what was intended with the chr/ entries. Maybe @Dclipsham can help? |
Where there are more precise format IDs for particular text encodings e.g. x-fmt/282 for ANSI, return these instead of generic x-fmt/111 plain text match.
(suggested by Greg Lepore)
The text was updated successfully, but these errors were encountered: