Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

for text matches, return more positive PRONOM IDs where these are available #114

Open
richardlehane opened this issue Apr 17, 2018 · 5 comments
Assignees

Comments

@richardlehane
Copy link
Owner

Where there are more precise format IDs for particular text encodings e.g. x-fmt/282 for ANSI, return these instead of generic x-fmt/111 plain text match.

(suggested by Greg Lepore)

@richardlehane
Copy link
Owner Author

I'm looking at this now and it will be a little but untidy.

My text detection routine (which is based on the file tool's algo) returns these text types:

ASCII                    // ASCII text
UTF7                     // UTF-7 Unicode
UTF8BOM             // UTF-8 Unicode (with BOM)
UTF8                     // UTF-8 Unicode
UTF16LE               // Little-endian UTF-16 Unicode
UTF16BE               // Big-endian UTF-16 Unicode
LATIN1                 // ISO-8859
EXTENDED           // Non-ISO extended-ASCII
EBCDIC                 // EBCDIC
EBCDICINT           // International EBCDIC

These don't map cleanly to PRONOM IDs.

PRONOM has these x-fmt IDs for various text types:

"x-fmt/14 (Macintosh Text File)"
"x-fmt/16 (Unicode Text File)"
"x-fmt/21 (7-bit ANSI Text)"
"x-fmt/22 (7-bit ASCII Text)"
"x-fmt/282 (8-bit ANSI Text)"
"x-fmt/283 (8-bit ASCII Text)"

These are all outline records and their meaning is a bit ambiguous. E.g. the ASCII that my text routine returns is ASCII proper (i.e. in the 0-127 byte range), so we could link it to x-fmt/22. But what is 8-bit ASCII? Does it map to Extended (i.e. extended Mac and IBM PC ASCII)? What is ANSI? Wikipedia suggests ANSI has no well defined meaning. Does it map to Latin1?

Complicating things further, the PRONOM database has unique IDs for UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE. But the IDs aren't traditional fmt or x-fmt IDs, they are "chr" IDs: chr/1, chr/2 etc. Suggest I could ignore these and just map to x-fmt/16 for the Unicode family?

So it won't be a clean mapping, but suggest could do this:

ASCII => x-fmt/22         
UTF7 => x-fmt/16
UTF8BOM => x-fmt/16
UTF8 => x-fmt/16
UTF16LE => x-fmt/16
UTF16BE => x-fmt/16
LATIN1 => x-fmt/282 // 8-bit ANSI
EXTENDED => x-fmt/283 // 8-bit ASCII
EBCDIC => x-fmt/111
EBCDICINT => x-fmt/111

Thoughts anyone?

@ross-spencer
Copy link
Collaborator

ross-spencer commented Sep 6, 2018 via email

@marhop
Copy link

marhop commented Sep 6, 2018

Worth mentioning that there's fmt/159 "EBCDIC-US". However with all the different EBCDIC codepages one PUID alone won't be enough.

I wonder if PRONOM has (or should have) another concept for encodings, like x-fmt/111 "plain text" + enc/123 "ASCII" or something similar ... Ignore that, I should have read your first post more thoroughly regarding the chr prefix.

@richardlehane
Copy link
Owner Author

thanks both for comments.

So, with EBDIC, could do:

ASCII => x-fmt/22         
UTF7 => x-fmt/16
UTF8BOM => x-fmt/16
UTF8 => x-fmt/16
UTF16LE => x-fmt/16
UTF16BE => x-fmt/16
LATIN1 => x-fmt/282 // 8-bit ANSI
EXTENDED => x-fmt/283 // 8-bit ASCII
EBCDIC => fmt/159
EBCDICINT => x-fmt/111

But, yes, agree with Ross that a bit weird that now EBCDICINT the only encoding that would return x-fmt/111 plain text, which is the PUID that most users have come to expect text to default to.

I should note that sf's current behaviour is to map all these text encodings to x-fmt/111 (and I copied this approach from Archivematica's fido plug-in which runs the file tool on unknowns and marks them as x-fmt/111 if file says text... so it isn't just Rosetta that has adopted x-fmt/111 as the text fall-back ID).

Options:

  1. I could just leave as is. Currently the "basis" field has the encoding information when there is a text match. I could make that basis field a bit richer by including these specific PRONOM IDs. Then the info is there if a power user wants to parse it out of that field

  2. I could leave as is but make configurable through roy.

Such an option could be a simple boolean flag if we're agreed on the list above:

roy build -textenc

Or a roy option could allow users to give their own text-encoding/PUID map like so:

roy build -textenc=ASCII,x-fmt/22,UTF7,x-fmt/16,UTF8BOM,x-fmt/16,UTF8,x-fmt/16,UTF16LE,x-fmt/16,UTF16BE,x-fmt/16,LATIN1,x-fmt/282,EXTENDED,x-fmt/283,EBCDIC,fmt/159,EBCDICINT,x-fmt/111

@marhop
Copy link

marhop commented Sep 7, 2018

Wouldn't it be wise to sort out the format vs. encoding thing in PRONOM first? I wonder what was intended with the chr/ entries. Maybe @Dclipsham can help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants