Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted page not "really" deleted #141

Closed
packdat opened this issue Jul 23, 2024 · 2 comments
Closed

Deleted page not "really" deleted #141

packdat opened this issue Jul 23, 2024 · 2 comments

Comments

@packdat
Copy link

packdat commented Jul 23, 2024

While working on incremental updates (see #112) and adding support for deleted objects, i encountered a behavior that may or may not be intended.

When i delete a page from a document, it gets removed from the pages-array as expected.
When checking the output-file, I observed, that only the page-reference was removed from the pages-array, the page itself and all referenced objects (i.e. content-streams) are still present in the file.

If I understand correctly, the method PdfCrossReferenceTable.Compact() is intended to clean up these objects, is that true ?
At least it would clean up (i.e. remove) the page and the objects referenced by that page, if the pages-array were the only place where the page is referenced.
But a page could be referenced from multiple locations, some places that come to mind:

  • Outlines
  • Named Destinations
  • Link-Annotations (and Annotations in general, /P entry)
  • GoTo Actions

In my case, the page, that was not deleted was referenced (at least) by 3 different outlines.

Simple test-case (add it to PdfSharp.Tests.IO.WriterTests):

[Fact]
public void Deleted_Page_Not_Really_Deleted()
{
    var sourceFile = IOUtility.GetAssetsPath("archives/grammar-by-example/GBE/ReferencePDFs/WPF 1.31/Table-Layout.pdf")!;
    var targetFile = Path.Combine(Path.GetTempPath(), "AA-Original.pdf");
    File.Copy(sourceFile, targetFile, true);

    using var fs = File.Open(targetFile, FileMode.Open, FileAccess.Read);
    using var doc = PdfReader.Open(fs, PdfDocumentOpenMode.Modify);
    doc.Pages.RemoveAt(0);

    targetFile = Path.Combine(Path.GetTempPath(), "AA-Deleted.pdf");
    doc.Save(targetFile);
}

Open the file AA-Deleted.pdf and observe, the page and it's contents are still present.

Question:
Is this the intended behavior ?
Are there other CleanUp-methods I'm not aware of ?

IMHO the methods to remove pages are "high level" methods and the library should take care of the "low level" stuff, including cleaning up after itself to maintain the integrity of the document.

I do understand however, that this might not be an easy issue to solve.
In theory, the library has to scan the whole document to find references to deleted pages and then has to decide based on the context (where the reference is found), how to deal with it.

  • delete Outlines and re-link the remaining ones
  • delete Annotations
  • etc...
@StLange
Copy link
Member

StLange commented Jul 24, 2024

Your observations are correct. The situation is difficult. You can remove a page, add some new pages, and then re-add the removed page again at a different position. The only point in time when an object clean-up makes sense is immediately before saving the document. PDFsharp starts with the catalog table and calculates the transitive closure of all referenced objects. Not referenced objects are removed. This approach works, but is to simple. If e.g. a font and an image are used only in the content stream of a particular page and you delete the page, both the font and the image remains in the resource tables of the document.

As you said, it would be possible to analyze a document on a detailed level and remove all resources that are not used in at least one content stream. Or remove all pages that cannot be reached from the Pages dictionary. Then remove all references to these pages from e.g. annotations etc. But what e.g. we do with an outline entry to a removed page?

The benefit of the clean-up is (only) a smaller PDF file, the visual result keeps the same.
If a developer deletes or reorders pages of an existing document with outlines, he has an intention why he do that. Therefore, he must also write code that fixes the outline tree.

Because there are much more important features we can implement, we do not plan to change the current behavior in the near future.

@packdat
Copy link
Author

packdat commented Jul 24, 2024

I agree, thanks for the answer !
Glad, I wasn't overlooking something obvious.

@packdat packdat closed this as not planned Won't fix, can't repro, duplicate, stale Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants