-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[advice] Siegried fails to identify a txt file when the filename extension is "wrong" #257
Comments
Hey @amayita What is siegfried doing here?Siegfried does do text identification (ASCII and other text encodings) but it gives priority to PRONOM signatures (including external signatures like file extensions). Generally speaking, this means that you will see the results of the text identification...
However, if the file does have a file extension that's in the PRONOM database, siegfried gives priority to that signal, even if it means returning Consider how many formats are text at their fundamental layer but in a structured format that may have a genuine PRONOM signature. E.g. XML formats with PRONOM signatures matching against namespaces or tags. If you had one of those XML formats that failed matching (because it was corrupt or differed somehow from the PRONOM signature), would you want siegfried to give an You can change the defaultsSiegfried's defaults are generally conservative (it prefers to give Can this behaviour be improved upstream?You ask if there is any possibility of an upstream fix here. This would be to change the PRONOM database. I think it was fairly common for users to give ".doc" extensions to their plain text files and it might be reasonable to add the "doc" extension to PRONOM's text ID, x-fmt/111. You could ask the PRONOM team at TNA to do this but this change would affect downstream DROID users so might not be accepted. If the TNA doesn't want this, you could make a custom signature of your own to do it: E.g. if you make a file like this, and put it in a "custom" folder within your siegfried home folder, then build with What improvements can be made to siegfried here?One improvement might be to add the results of siegfried's text matching to the warning message for an unknown file. E.g. your result might still be unknown with that long list of possible formats based on extensions, but I might be able to add information about the text encoding to that warning message. Another possible improvement could be a new build flag or build mode for Anyway, apologies again for the length of this message, I'd be very interested in feedback here. cheers |
@amayita sorry I should have checked your bio before posting rather than after!!! ... wherever I said "ask the archivematica team to do X", please just replace with, "do X" :) |
Hello back, Richard! You lost me at PRONOM, humble sysadmin here 😄 LOL Still, thank you so much for your quick and detailed answer. I think I get the picture and am very thankful that you got into different approaches to solve this "issue". I think you worded eloquently that is is a conversation about defaults instead of an issue itself? Regarding the "b0rked" XML file, I'd rather have XML or TXT than unknown, but I know nothing about your realm, and whenever I think I've learned a lot I am quickly humbled ;) In my use case, the issues with the file should have been detected way before it reaches archivematica, for digital preservation, but I also see the benefit of preserving something that might not work "perfectly". I guess this needs a case by case evaluation. I'm also interested in seeing feedback from others (that actually know what they are talking about here, unlike me)! 😄 Thanks again! |
Hello there!
I am using Siegried within archivematica to identify files, and came across one issue that is maybe "minor", but easy to fix?
When a plain ASCII text file has a .doc extension to its file name, like
ASCII_text.doc
, Siegfried fails, as it assumes it is a Word doc and does not even attempt to identify what's actually in there.The
file
command does identify theASCII_text.doc
file as ASCII text 😄Is there any way we could improve this behavior upstream? Is this too small to waste time on this?
I really think an ascii txt file should not fail to be identified, no matter the filename.
More info:
But siegfried output for the same file:
Thanks for any input on this!
The text was updated successfully, but these errors were encountered: