I made a MathType and Office OMML rendering pipeline, and a more comprehensive dataset #383
gzz2000
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Out-of-the-box model uses latex renderers and generalizes poorly to images rendered by other backends. I frequently encounter equations rendered by the classic MathType editor and by the modern office equations (OMML).
I managed to create a data generation pipeline for these two backends that automatically render LaTeX equations as MathType or OMML images. I share some key ingredients here.
For OMML, the pipeline looks like: latex -> pypandoc (pandoc installed) -> docx -> docx2pdf (winword installed) -> pdf -> pymupdf -> png -> pillow.
One can filter out any unsupported latex error by specifying
--fail-if-warnings
in pandoc.For MathType, things are very tricky. As I did not find official python API for it, I made a pipeline using pure UI automation (pywinauto, pymouse etc). The script opens MathType standalone window, pastes the latex equations there (which is recognized and translated automatically to MathType), move keyboard focus away, and take a screenshot of the MathType window before cropping and postprocessing the image. The TeX translation errors, unsupported commands are filtered using OpenCV, by setting a threshold on red pixels, and pattern matching. It is clumsy but it works.
Original data (xelatex) test/0000000.png:
MathType rendered:
Office OMML rendered:
I am now training the model using 300k+ pairs from all 3 backends. I hope it works:)
Beta Was this translation helpful? Give feedback.
All reactions