Save some more of the intermediate results produced in dump-tjs #8

ujjvlh · 2021-08-17T11:11:08Z

To save compute time spent in extracting operators, etc. from PDF, save the results in a convenient format (perhaps text, or use 'serde') for text-only analysis for quick improvements and testing of stuff like glyph maps and regex.

shreevatsa · 2021-08-17T14:26:11Z

I think that's what the "phase1" was supposed to do (it just dumps the glyph ids used for each text operation):

pdf-glyph-mapping/work/Makefile

Lines 39 to 41 in af4cf8b

    
           # ORIG.pdf  ---[dump-tjs]--->  font-usage/font-N.{Tjs,toml} (but for now, Tjs-N) 
        
           font-usage/: ${ORIG_PDF} 
        
           	RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- ${ORIG_PDF} font-usage/ --phase phase1

But I guess we could make it better by splitting text per page (and maybe all fonts on that page together…), and replace the second run

pdf-glyph-mapping/work/Makefile

Lines 64 to 66 in af4cf8b

    
           # ORIG.pdf and maps/valid/font-N.toml ---[dump-tjs]---> ORIG.fixed.pdf 
        
           $(ORIG).fixed.pdf: ${ORIG_PDF} maps/valid/ 
        
           	RUST_BACKTRACE=1 cargo +nightly run --release --bin dump-tjs -- ${ORIG_PDF} maps/valid/ ${ORIG}.fixed.pdf --phase phase2

with something that just works on the dumped sequences and generates corresponding text directly, so that we don't have to generate the PDF (which is very slow) and run pdftotext on it.

Then when the text seems satisfactory, generating the new PDF can be the last step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save some more of the intermediate results produced in dump-tjs #8

Save some more of the intermediate results produced in dump-tjs #8

ujjvlh commented Aug 17, 2021

shreevatsa commented Aug 17, 2021

Save some more of the intermediate results produced in dump-tjs #8

Save some more of the intermediate results produced in dump-tjs #8

Comments

ujjvlh commented Aug 17, 2021

shreevatsa commented Aug 17, 2021