Wrong position of cell in table while parsing table in PDF #707

duongkstn · 2025-01-08T08:47:36Z

Bug

...
Wrongly parse PDF which contains only 1 table. here is the PDF (in Vietnamese):
mountain_table.pdf

Steps to reproduce

...

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions, TesseractOcrOptions
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
from docling.backend.docling_parse_backend import DoclingParseDocumentBackend

import markdown2

source = "mountain_table.pdf"
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.mode = "accurate"
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["vie"])

try:
    dl_doc = DocumentConverter(format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,  # pipeline options go here.
            backend=DoclingParseV2DocumentBackend
        )}).convert(source).document
    print('1')
except Exception as e:
    dl_doc = DocumentConverter(format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,  # pipeline options go here.
            backend=DoclingParseDocumentBackend
        )}).convert(source).document
    print('2')


text = dl_doc.export_to_markdown()
html = markdown2.markdown(text, extras=["tables"])
with open(f"mountain_table.html", 'w', encoding="utf-8") as f:
    f.write(html)

Here is the result:

which is wrong at cell "Mount Radenor", I do not know how to fix this case !

Docling version

docling==2.7.0
docling-parse==2.1.2

Python version

python 3.10

The text was updated successfully, but these errors were encountered:

gauravmindzk · 2025-01-08T11:11:33Z

we're also facing the same issue , in some tables in our pdf , some of the rows are getting merged like yours leading to incorrect parsing.

duongkstn added the bug Something isn't working label Jan 8, 2025

duongkstn changed the title ~~wrong position of cell if table~~ Wrong position of cell in table while parsing table in PDF Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong position of cell in table while parsing table in PDF #707

Wrong position of cell in table while parsing table in PDF #707

duongkstn commented Jan 8, 2025 •

edited

Loading

gauravmindzk commented Jan 8, 2025

Wrong position of cell in table while parsing table in PDF #707

Wrong position of cell in table while parsing table in PDF #707

Comments

duongkstn commented Jan 8, 2025 • edited Loading

Bug

Steps to reproduce

Docling version

Python version

gauravmindzk commented Jan 8, 2025

duongkstn commented Jan 8, 2025 •

edited

Loading