Skip to content

Convert scientific publications in PDF to structured Markdown via only lightweight ONNX OCR models

License

Notifications You must be signed in to change notification settings

yuanjua/PaperStructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperStructure

PaperStructure is a lightweight CLI tool designed to transform academic papers into clean, structured Markdown. By leveraging ONNX models, it delivers high-performance inference optimized for standard laptops. It is a reliable companion for formula-heavy research, though users may currently observe lower accuracy in table recognition.

Features

  • Layout Detection -- YOLOX detects titles, sections, paragraphs, formulas, tables, figures
  • Text Recognition -- PP-OCRv5 ONNX pipeline
  • Formula Recognition -- Encoder-decoder LaTeX OCR
  • Markdown Export -- clean, readable markdown output
  • Parallel Processing -- multi-threaded PDF page processing

Demo

PDF Markdown
Screenshot 2026-02-11 at 21 54 00 Screenshot 2026-02-11 at 22 20 52

Installation

pip install paper-structure

This registers the paper-structure CLI and installs the Python package.

CLI Usage

# Process a PDF (full pipeline: layout + OCR + formula)
paper-structure process paper.pdf -o output.md

# Shorthand:
paper-structure paper.pdf -o output.md

# OCR an image (text recognition, no layout detection)
paper-structure process photo.png -o output.txt

# Recognize a formula image as LaTeX
paper-structure process formula.png --formula

# PDF options
paper-structure process paper.pdf --max-pages 5 -v --save-images

# Generate annotated preview PDF with bounding boxes
paper-structure preview paper.pdf -o preview.pdf

# Manage models
paper-structure models status
paper-structure models download

Python API

PDF processing (full pipeline)

from paper_structure import PaperStructurePipeline

pipeline = PaperStructurePipeline()
result = pipeline.process_pdf("paper.pdf")
print(result["markdown"])
pipeline.save_markdown(result, "output.md")

Image OCR

from paper_structure import OCR

ocr = OCR()

# Text recognition (default)
print(ocr("table.png"))

# LaTeX formula recognition
print(ocr("formula.png", formula=True))

Model Management

from paper_structure.models import registry

registry.ensure_all()       # pre-download everything
print(registry.status())    # show cache status

Models

The tool automatically downloads models on its first call. All model weights are hosted at hpllduck/PaperStructure (~399 MB total) and cached locally via huggingface_hub.

Group Files Description
latex_ocr encoder, decoder, image_resizer, tokenizer RapidLaTeXOCR formula recognition
yolox yolox_l0.05.onnx YOLOX-L document layout detection
paddle_ocr det, cls, rec, dictionary PP-OCRv5 text detection/recognition

License

Apache License 2.0. Individual model weights retain their original licenses (MIT for LaTeX OCR, Apache-2.0 for YOLOX and PaddleOCR).

Acknowledgments

About

Convert scientific publications in PDF to structured Markdown via only lightweight ONNX OCR models

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages