Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong position of cell in table while parsing table in PDF #707

Open
duongkstn opened this issue Jan 8, 2025 · 1 comment
Open

Wrong position of cell in table while parsing table in PDF #707

duongkstn opened this issue Jan 8, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@duongkstn
Copy link

duongkstn commented Jan 8, 2025

Bug

...
Wrongly parse PDF which contains only 1 table. here is the PDF (in Vietnamese):
mountain_table.pdf

Steps to reproduce

...

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions, TesseractOcrOptions
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend

import markdown2

source = "mountain_table.pdf"
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.mode = "accurate"
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["vie"])

try:
    dl_doc = DocumentConverter(format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,  # pipeline options go here.
            backend=DoclingParseV2DocumentBackend
        )}).convert(source).document
    print('1')
except Exception as e:
    dl_doc = DocumentConverter(format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,  # pipeline options go here.
            backend=DoclingParseDocumentBackend
        )}).convert(source).document
    print('2')


text = dl_doc.export_to_markdown()
html = markdown2.markdown(text, extras=["tables"])
with open(f"mountain_table.html", 'w', encoding="utf-8") as f:
    f.write(html)

Here is the result:
Screenshot from 2025-01-08 15-46-40

which is wrong at cell "Mount Radenor", I do not know how to fix this case !

Docling version

docling==2.7.0
docling-parse==2.1.2

Python version

python 3.10

@duongkstn duongkstn added the bug Something isn't working label Jan 8, 2025
@duongkstn duongkstn changed the title wrong position of cell if table Wrong position of cell in table while parsing table in PDF Jan 8, 2025
@gauravmindzk
Copy link

we're also facing the same issue , in some tables in our pdf , some of the rows are getting merged like yours leading to incorrect parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants