-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleted page not "really" deleted #141
Comments
Your observations are correct. The situation is difficult. You can remove a page, add some new pages, and then re-add the removed page again at a different position. The only point in time when an object clean-up makes sense is immediately before saving the document. PDFsharp starts with the catalog table and calculates the transitive closure of all referenced objects. Not referenced objects are removed. This approach works, but is to simple. If e.g. a font and an image are used only in the content stream of a particular page and you delete the page, both the font and the image remains in the resource tables of the document. As you said, it would be possible to analyze a document on a detailed level and remove all resources that are not used in at least one content stream. Or remove all pages that cannot be reached from the Pages dictionary. Then remove all references to these pages from e.g. annotations etc. But what e.g. we do with an outline entry to a removed page? The benefit of the clean-up is (only) a smaller PDF file, the visual result keeps the same. Because there are much more important features we can implement, we do not plan to change the current behavior in the near future. |
I agree, thanks for the answer ! |
While working on incremental updates (see #112) and adding support for deleted objects, i encountered a behavior that may or may not be intended.
When i delete a page from a document, it gets removed from the pages-array as expected.
When checking the output-file, I observed, that only the page-reference was removed from the pages-array, the page itself and all referenced objects (i.e. content-streams) are still present in the file.
If I understand correctly, the method
PdfCrossReferenceTable.Compact()
is intended to clean up these objects, is that true ?At least it would clean up (i.e. remove) the page and the objects referenced by that page, if the pages-array were the only place where the page is referenced.
But a page could be referenced from multiple locations, some places that come to mind:
In my case, the page, that was not deleted was referenced (at least) by 3 different outlines.
Simple test-case (add it to
PdfSharp.Tests.IO.WriterTests
):Open the file
AA-Deleted.pdf
and observe, the page and it's contents are still present.Question:
Is this the intended behavior ?
Are there other CleanUp-methods I'm not aware of ?
IMHO the methods to remove pages are "high level" methods and the library should take care of the "low level" stuff, including cleaning up after itself to maintain the integrity of the document.
I do understand however, that this might not be an easy issue to solve.
In theory, the library has to scan the whole document to find references to deleted pages and then has to decide based on the context (where the reference is found), how to deal with it.
The text was updated successfully, but these errors were encountered: