Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save some more of the intermediate results produced in dump-tjs #8

Open
ujjvlh opened this issue Aug 17, 2021 · 1 comment
Open

Save some more of the intermediate results produced in dump-tjs #8

ujjvlh opened this issue Aug 17, 2021 · 1 comment

Comments

@ujjvlh
Copy link
Contributor

ujjvlh commented Aug 17, 2021

To save compute time spent in extracting operators, etc. from PDF, save the results in a convenient format (perhaps text, or use 'serde') for text-only analysis for quick improvements and testing of stuff like glyph maps and regex.

@shreevatsa
Copy link
Owner

I think that's what the "phase1" was supposed to do (it just dumps the glyph ids used for each text operation):

# ORIG.pdf ---[dump-tjs]---> font-usage/font-N.{Tjs,toml} (but for now, Tjs-N)
font-usage/: ${ORIG_PDF}
RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- ${ORIG_PDF} font-usage/ --phase phase1

But I guess we could make it better by splitting text per page (and maybe all fonts on that page together…), and replace the second run

# ORIG.pdf and maps/valid/font-N.toml ---[dump-tjs]---> ORIG.fixed.pdf
$(ORIG).fixed.pdf: ${ORIG_PDF} maps/valid/
RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- ${ORIG_PDF} maps/valid/ ${ORIG}.fixed.pdf --phase phase2
with something that just works on the dumped sequences and generates corresponding text directly, so that we don't have to generate the PDF (which is very slow) and run pdftotext on it.

Then when the text seems satisfactory, generating the new PDF can be the last step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants