-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quicktime identification very slow with recent version #99
Comments
Hi Thomas The immediate cause of this (massive) slowdown seems to be a change to PRONOM's signatures for x-fmt/384. Looking at the PRONOM release notes (http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml) there was a change in v86 (with the note "simplified signature"). We can isolate this change by looking at historic versions of the PRONOM database in Ross Spencer's PRONOM archive on github (https://github.com/exponential-decay/pronom-archive-and-skeleton-test-suite/releases). From the basis field in your results, the first match was against the eighth of eleven signatures for the v84 release, which was this one: The second match was against the fourth of eight signatures (so three likely removed in that "simplification" of the signatures): You can see that the first signature can be satisfied very quickly, just by looking at the first few bytes of the file. The second signature will take potentially much longer as that wildcard (the "*" in the signature) means that the fragment beginning "moov" can occur anywhere in the file after that first fragment (and is defined at an offset from the beginning of the file, not the end of the file). In your example file this second fragment occurs right near the end of the file which means > 200 GB of reading and scanning to discover it. In terms of your question about sf using seek: this depends on the type of input source. For most files, One possible solution for you is to customise your signature file using the
I'm going to label this as PRONOM as I think the issue predominantly lies there. But I will also have a bit of a think to see if any improvements can be made to siegfried to better deal with your use case without having to mod signatures. Apologies for length of this response, hope it makes sense, |
Thanks Richard, Hi Thomas. An explanation for the change can be found here: https://groups.google.com/forum/#!topic/pronom/Q2mXbNmNTbU Basically the previous set of Quicktime signatures seemed to have built up over time based primarily on observation each time we hit a variant QT file that wouldn't ID. We were still hitting false negatives in our own collections, and receiving reports of the same from elsewhere, so I took the decision to restart based upon a stricter interpretation of the QT specification, which requires a moov atom. The moov atom contains the movie metadata and current advised practise (vital for streaming) is to place it at the beginning of the file, but as we've seen it can appear anywhere in the file, not simply either at the start or the end. For identification I favour accuracy over speed, so I think prefer to leave the signatures as they are now (but I'm happy to hear arguments against). That said, in this specific scenario, with the lead brand being described from offset 4 as 'ftypqt ' - could this be the one variant where it makes sense to drop the necessity for subsequently finding the moov atom? Perhaps there are also byte seek optimisations that could assist? Note that MP4, plus all of the Broadcast Wave variants (among others) also contain a full wildcard pattern so could also be subject to lengthy ID times (see also Ross Spencer's investigation into the WAV variants: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/final-report/digital-preservation-stage-boss-one.pdf , and full list of wildcards from v86 DROID signature file: https://github.com/exponential-decay/digital-preservation-stage-boss-one/blob/master/wildcard-signature-information/PRONOM-wildcard-signatures-v86.csv). David |
Guys, I've been away for some days and I only now discover your answers. Thanks a lot for such informative and complete answers, both of you ! I had noticed that I could try to use roy to get a default.sig that better suits my needs, but I had failed to do so. Richard explained clearly my misunderstanding : it's not a sig "at the end" but a wildcard that can happen anywhere. I did a quick test with |
Hi Thomas This makes it quite a hard problem as we can't just do a simple modification to the buffer sizes to speed this up. In order to satisfy the signature as it is now written, we really have to do a wild card search of at least the first 2GB of your file, which is always going to take a long time. Ultimately the best solution would be a PRONOM change along the lines @Dclipsham suggests: add a new signature that would just match on the "ftypqt" brand, and drop the requirement to also find a "moov" fragment. Rather than wait for a PRONOM update, you can use the You can use @ross-spencer 's Signature Development Utility http://exponentialdecay.co.uk/sd/index.htm to build custom signatures. They should go in a /custom folder within your ~/siegfried directory. Creating the signature looks like this: Note I deliberately didn't put an extension in as the extensions for this format are already defined in the main DROID file and we don't want to override those. Running this command and then testing looks like this: For your convenience, here is the extension file that I used: I hope this works for your use case. |
Hi,
I tested adding your quicktime-ext.xml, and the resulting default.sig only takes 15S to identify a QT file of ~148G. This looks promising, thanks a lot ! I understand perfectly and fully agree with your statement of "accuracy over speed". I don't understand (yet) much of the signature xml format, but i was wondering if it would make sense to use both the previous QT signatures (fast/empirical ?) and the new, stricter ones, in a way that the faster are tried first ? If I understood well, you have some non-regression tests with lot of samples, so this would mitigate this risk, wouldn't it ? |
That Re. adding all the old signatures into the mix, you could certainly do that & in fact that's pretty much what I did by using The Here's a revised extension file for you that has all the v84 QT sigs: By including the old QT signatures alongside the new ones, you certainly do re-introduce the risk of some non-QT files matching as QT, if those old signatures are too permissive. But I think given the size of the files you are dealing with is definitely reasonable to customise your signatures to fit your use case. |
Works perfectly, thanks again. |
I'm working on a couple of performance improvements but they're pretty subtle & to be honest it is very hard to speed up this kind of case where forced to scan > 2G of bytes to find a wildcard pattern that might not even exist (& I/O being the real time killer). I.e. I may be able to shave some secs, but not minutes. I think the real fix to this issue would be a PRONOM update to include non-wildcard based signatures for this format - @Dclipsham seemed partly open to this & I'd suggest continuing the discussion with him and the TNA team. For the default.sig signature file, I think it is important that this represent the PRONOM database exactly, so I don't propose altering it by default with the "quicktime-ext.xml" file. But I will include that custom signature file with the set of custom signatures included with sf releases (you may have seen some archivmatica extension signatures in the custom folder if you install with a package manager). This would mean you would still need to do a Suggest leaving this ticket open until it is fixed at PRONOM end & sorry I can't do more to assist |
Yes, I'll adjust the PRONOM signature entry for this specific scenario (applies to the signature 'QuickTime variant4'): where the first atom at offset 4 is FTYP and the major brand is 'qt ' then we won't seek the MOOV atom. This represents a byte sequence of 0x6674797071742020 found from offset 4 only. Where the first atom at offset 4 is any from MDAT, CMOV, PNOT, SKIP, FREE, WIDE, then we will continue to seek the MOOV atom within the file. There'll likely be a release in late-May/early-June. Thanks for raising this. I hope this solution is satisfactory for all. David |
this change has now been made in the PRONOM v93 release so will close this issue. Thanks @Dclipsham! |
Hi everybody, If I still have this problem in 2020, with the latest version of Archivematica, should I open a new ticket? I have a 170 GB MOV file. It takes almost 2 hours to be identified as a x-fmt/384 / Quicktime file. siegfried : 1.8.0
|
arrgh, ghosts of the past! Are you able to post the output showing the id of the file (it would be helpful to see what the "basis" field says)? |
Here's the full output:
|
that makes sense, it's matching against this pattern: mdat*moov{0-4096}(mvhd|cmov|rmra) for x-fmt/384 @Dclipsham fixed this issue last time by adjusting one of the patterns for this format. Perhaps a similar solution is possible this time round? Another possible solution is a tailored signature file along one the lines suggested above e.g. if you do I do have some ideas for performance improvements but realistically any scenario in which you need to scan all ~180GB of this file for a match will be slow. |
I understand. The largest MOV file we have has a size of 1.5 TB, so it could be even worse... :-) Just as a reference point (I haven't looked at the code or anything), the whole transfer/ingest phases in Archivematica took about 3 hours:
So my feeling is that even if Siegfried has to look at the whole file, it could be way faster (e.g. the hash computation phase also had to read the whole file and do something with it). But to be honest I don't know the history/complexity of the project, so sorry if I'm way off. |
Thanks for those numbers, they're an interesting comparison and something to aim for! I have some optimisations in mind that I hope will improve performance for cases like this but format ID is an expensive task (consider that PRONOM contains 1000s of regex like patterns that all need to be searched) that will always be relatively slow if you need to do full file scans. If you are routinely dealing with big Quicktime files like these, and if you need a PRONOM-based identification, I'd suggest:
|
Tricky one. The moov atom is vital for ID of Quicktime - things won't play without it. We try to anchor it near the beginning of the file with one of the other atoms (mdat, cmov, free etc) so we'll only search the rest of the file if we find one of those atoms first, but unfortunately the moov can appear anywhere in the file (although I understand best practice is near the beginning, particularly for streaming). A while back we were only seeking certain atoms (not necessarily including moov) but that gave us false positives so I'm not sure what we could do to make the signature any more efficient... |
Thanks. I'll try roy. For now, I'm using the "File Extension" tool in Archivematica as a workaround. Fido seems to misidentify our files (as Quicktime + Apple ProRes, which they are not...). I don't know why. |
If modifying your signatures with roy, there's a few different approaches you could take (such as setting fixed -bof or -eof or limiting your scan to a set of fmts), but the cleanest may be just to extend your signature file with the old PRONOM sigs, as described here: #99 (comment) If you're running within archivematica, your siegfried home path won't be ~/siegfried it will be something else, which you should be able to find by doing |
Hello,
When using siegfried over a quicktime (x-fmt/384) file of typical size ~200G, the identification time exploded from very few seconds (siegfried 1.5) to dozens of minutes (siegfried 1.7).
First test (extract from the json output):
"siegfried":"1.5.0"
"signature":"default.sig"
"details":"DROID_SignatureFile_V84.xml; container-signature-20160121.xml"
"basis":"extension match mov; byte match at 0, 12 (signature 8/11)"
Second test :
"siegfried":"1.7.0"
"signature":"default.sig"
"details":"DROID_SignatureFile_V88.xml; container-signature-20160927.xml"
"basis":"extension match mov; byte match at [[[4 8]] [[2046628976 12]]] (signature 4/8)"
Our current guess is the following: according to the output, the signature has changed to now also use one (or some) bytes at the very end of the file. And for some unknown reason, siegfried reads the whole file to checks those bytes (instead of seeking there as I would expect).
Do you have any more information about this ? Can you confirm this analysis ? Is there any reason blocking the use of 'seek' to reach the end of the file ?
The text was updated successfully, but these errors were encountered: