Text Extraction Improvements #2038

MartinThoma · 2023-07-29T15:36:20Z

MartinThoma
Jul 29, 2023
Maintainer

In the context of the LaTeX / math mode improvements which @pubpub-zz recently did, I had a look at what we could do to further improve text extraction: #2016 (comment)

I mainly don't want to loose this list. But maybe somebody has some ideas how to implement them :-)

The ligature replacement seems to be rather straight-forward.

pubpub-zz · 2023-07-30T13:19:17Z

pubpub-zz
Jul 30, 2023
Maintainer

(from #2016 to ease follow on)
The text extracting quality metric increased from 96% to 97%. I've also found a couple of places where the ground truth was wrong 🎉 We a now on-par with Tika / PyMuPDF. However, the felt quality is still slightly worse as Tika / PyMuPDF typically deal with whitespaces better.

I had a look at what would be necessary to lift the text extraction to the next step (from a users perspective):
Local optimizations

Ligature replacement:

ﬁ should be fi
ﬂ should be fl
ﬀ should be ff

I know that this actually goes away from "raw" text extraction, but I think this is what most users want. Maybe we need to re-define what we want to achieve and potentially add flags / methods for common post-processing 🤔

What about just providing some indications in the documentation pages to use the replace function. I'm not sure that subsitution will be ok for all cases.

Composed characters:

¯x should be x̄

ˆx should be x̂

I have not found some cases where to meet such a case : can you provide.

Chinese characters in arxiv 2201.00021: The name one the first page.

Removal of hypens inserted solely to fit on the line: Here I'm uncertain. I think most people use the text extraction to do Natural Language Processing (NLP). For them, the hyphens are just noise. But some might need the layout mode to do post-processing on their own. Then hyphen-removal might actually harm.
_used to be addressed in Ligature issue when converting PDF to text #1351,BUG: Added line-breaks at dashes #234 _

Help would be welcomed to find a reliable rule about it.

Superscript / subscripts: Especially squares (x²) and cubes (x³) as well as zero-subscripts (x₀) and one-subscripts (x₁)

Whitespace
used to be addressed in New line character missing and URLs adding periods and space #1974(part), TST: Add test for #1897 #1907,Space regression by PR 1172 #1362,Missing spaces in extract_text() method #1328,Issue in text extraction (spaces) #1153

Most important are inner-word spaces that often occur after the first letter of a word. See https://github.com/py-pdf/pypdf/issues/1507
Newlines, especially for [arXiv 2201.00029](https://arxiv.org/pdf/2201.00029.pdf)
Spaces around math-mode stuff
Spaces after dots: https://github.com/py-pdf/pypdf/issues/1974

Layout-mode
used to be addressed in New line character missing and URLs adding periods and space #1974(part),Two files which look identical (on first inspection) produce different line breaks when extracting text #1395,v2.1 extract_text() misses newline characters #957
Indentation of code blocks currently completely breaks.
Multiple newlines to represent paragraph / section boundaries

Advanced text extraction normalization

This will likely never go into pypdf as it requires a level of document understanding that is likely only achievable with machine learning. Still interesting to think about it:
Detection of tables + automatic application of layout mode for them, while not using layout mode for e.g. two-column pages.
Removal of footers (page numbers)
Removal of headers
Removal of spaces used for thousands separation
Detection of text that belongs to an image / diagram
Re-structuring of text that is broken up by an image to ensure a smooth text flow

0 replies

pubpub-zz · 2023-07-30T14:10:48Z

pubpub-zz
Jul 30, 2023
Maintainer

@MartinThoma
Could you do some clean-up in issues to prevent duplicates, but merging the examples provided.

1 reply

MartinThoma Jul 30, 2023
Maintainer Author

I can create separate issues if you want. In that case I would suggest the following issues:

ENH: Provide a post-processing function to replace ligatures (see https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets) )
ENH: Provide a post-processing function for de-hyphenation
ENH: Extract superscripts (x² instead of x2)
BUG: Whitespaces after the first letter of words
ENH: Extract composed characters (x̂ instead of ˆx)

Does that sound good? If yes, I would also add examples and try to make minimal examples. Potentially also failing tests.

For the others, I wouldn't create issues as I think they are to complex (let me know if I'm wrong).

I could also take care of (1) + (2): I want to make a proposal of this first and hear other peoples opinion on that. I'm uncertain if it should be part of pypdf as it's rather far away from PDF manipulation and more on the NLP-side.

pubpub-zz · 2023-07-30T15:49:41Z

pubpub-zz
Jul 30, 2023
Maintainer

I can create separate issues if you want. In that case I would suggest the following issues:
1. ENH: Provide a post-processing function to replace ligatures (see https://en.wikipedia.org/wiki/Ligature_(writing)#Ligatures_in_Unicode_(Latin_alphabets) )

2. ENH: Provide a post-processing function for de-hyphenation

3. ENH: Extract superscripts (x² instead of x2)

4. BUG: Whitespaces after the first letter of words

5. ENH: Extract composed characters (x̂ instead of ˆx)
Does that sound good? If yes, I would also add examples and try to make minimal examples. Potentially also failing tests.

For the others, I wouldn't create issues as I think they are to complex (let me know if I'm wrong).

I could also take care of (1) + (2): I want to make a proposal of this first and hear other peoples opinion on that. I'm uncertain if it should be part of pypdf as it's rather far away from PDF manipulation and more on the NLP-side.

Sounds good but also we have to keep in mind to remove the duplicates we have now to have a better visilibity on the work to to be done.

for 5) check my latest PR

for 3) and 4) it may be already quite tough 🤔

4 replies

MartinThoma Jul 30, 2023
Maintainer Author

3): #2045 - interesting! It already works as expected for LibreOffice. I need to check why it doesn't for LaTeX 🤔

pubpub-zz Jul 30, 2023
Maintainer

There is multiple solution to do superscript, either use some specific characters, else you can write a small text shifted up or down.

MartinThoma Jul 30, 2023
Maintainer Author

Mainly (1) but also 2): #2046

MartinThoma Aug 1, 2023
Maintainer Author

Another one from the analysis: #2054

shartzog · 2023-10-04T05:58:12Z

shartzog
Oct 4, 2023

This isn't anywhere close to being universal enough for implementation, but thought I'd share an approach that's working for me in a narrowed use case in case it spurs some ideas.

PDFs generated by Epic (the biggest EMR in the US) have an oddball internal structure. The OOTB pypdf extract_text() function was returning the text more or less "raw": elements were distributed vertically according to the order in which the Text Show operators appeared with virtually no spacing between horizontally distributed elements. Fidelity to the rendered layout (especially for tabular data) was critical to my use case, so I created the routines below to address the issue.

Details of First Implementation

Caveats:

The Epic report engine generally uses # # Td (some text) Tj for text rendering. There are a handful of Tms and TJs as well, but no T*, TL, ', or " operators, so I've made no attempt to accommodate them as of yet..
I'm deliberately flushing lines that are empty or all whitespace. I make no attempt to insert empty lines for large delta y translations.
I've failed thus far to properly utilize the "space_width" data included with the fonts. Unsure how this should scale with the various xforms / font size. Currently using a "font_width" (font size * horizontal scale factor resulting from nested transformation matrices) as an approximation.
I'd give myself a C- for font encoding handling, but this works for me 🤷
I'm not considering char or word spacing operators or horizontal scaling operators (support forthcoming as soon as something breaks bc of it lol).
I've tested this on other PDF formats with moderate success. Generally, the shortcomings are related to:
- improperly combining what should be independent vertically distributed lines (scaling issue)
- single spaces inserted into multi-Td BTs or TD operations (e.g. G eneral Information instead of General Information)
- unimplemented operators already mentioned (not relevant for my use case so oh well so far)
Some other stuff I'm sure I've missed 😆

In any case, the basic idea is to collect the full set of text render operations with corresponding "effective transform" matrices before "putting pen to paper" as it were. Data is collected in a "per BT" manner and the effective transform data is used to sort and horizontally distribute the collected text once I have all the facts. A simple sort by (-y, x) followed by a groupby on y gets me very close fully aligning the text vertically (reducing on a delta y/font_width comparison gets me the rest of the way). To align horizontally, I make sure that all strings having a common x share the same start index. The last tunable parameter (space_expansion_factor) is required to make sure sufficient space is inserted per delta x to prevent previously rendered data from overlapping with the next field. (NOTE: "x" and "y" references above are to the scaled x and y coordinate origins of rendered text objects.)

All of this may be useless given the narrow scope, but some of the concepts may be helpful. E.g.:

the _XformStack class for managing transformation matrices.
the recursive operator evaluation (piecemealing by "q" and "BT")
the "collect first / compile later" paradigm

Let me know if this seems a worthy pursuit. I'd be happy to create a branch and extend the existing pypdf text extraction logic with this new extract_structured_text function to further the discussion.

"""new pdf text extraction algorithm

Usage:
    import io
    from pathlib import Path
    from pypdf import PdfReader
    fname = "FB01219A86F94518818875AB0828B31D_pg1.PDF"
    byt = Path(fname).read_bytes()
    tpdf = PdfReader(io.BytesIO(byt), False)
    Path(f"{fname}.txt").write_text("\n".join(extract_structured_text(pg) for pg in tpdf.pages))
"""
# pylint: disable=invalid-name

import json
import math
from collections import ChainMap, Counter
from collections.abc import Iterator
from copy import copy
from itertools import groupby, pairwise
from pathlib import Path
from typing import Any, NamedTuple, cast

from pypdf import PageObject, _cmap
from pypdf import _text_extraction as tex
from pypdf.constants import PageAttributes as PG
from pypdf.generic import ContentStream, DictionaryObject, NameObject


class _Font(NamedTuple):
    space_width: int | float
    encoding: str | dict[int, str]
    char_map: dict


class _XfrmStack:
    """cm/tm transformation matrix manager"""

    def __init__(self) -> None:
        self.xfrm_stack = ChainMap(self.new_xform())
        self.q_queue: Counter[int] = Counter()
        self.q_depth = [0]

    @staticmethod
    def raw_xform(_a=1.0, _b=0.0, _c=0.0, _d=1.0, _e=0.0, _f=0.0):
        """only a/b/c/d/e/f matrix params"""
        return dict(zip(range(6), map(float, (_a, _b, _c, _d, _e, _f))))

    @staticmethod
    def new_xform(_a=1.0, _b=0.0, _c=0.0, _d=1.0, _e=0.0, _f=0.0, /, is_text=False):
        """a/b/c/d/e/f matrix params + 'is_text' key"""
        return _XfrmStack.raw_xform(_a, _b, _c, _d, _e, _f) | {"is_text": is_text}

    def reset_tm(self) -> ChainMap[int | str, float | bool]:
        """clear all xforms from chainmap having is_text==True"""
        while self.xfrm_stack.maps[0]["is_text"]:
            self.xfrm_stack = self.xfrm_stack.parents
        return self.xfrm_stack

    def remove_q(self):
        """rewind to stack prior state after closing a 'q' with internal 'cm' ops"""
        self.xfrm_stack = self.reset_tm()
        self.xfrm_stack.maps = self.xfrm_stack.maps[self.q_queue.pop(self.q_depth.pop(), 0) :]
        return self.xfrm_stack

    def add_q(self):
        """add another level to q_queue"""
        self.q_depth.append(len(self.q_depth))

    def add_cm(self, *args):
        """concatenate an additional transform matrix"""
        self.xfrm_stack = self.reset_tm()
        self.q_queue.update(self.q_depth[-1:])
        self.xfrm_stack = self.xfrm_stack.new_child(self.new_xform(*args))
        return self.xfrm_stack

    def add_tm(self, operands: list[float | int]):
        """append a text transform matrix"""
        if len(operands) == 2:
            operands = [1.0, 0.0, 0.0, 1.0, *operands]
        self.xfrm_stack = self.xfrm_stack.new_child(
            self.new_xform(*operands, is_text=True)  # type: ignore  # mypy issue??
        )
        return self.xfrm_stack

    @property
    def effective_xform(self) -> list[float]:
        """the current effective transform account for both cm and text xforms"""
        eff_xform = [*self.xfrm_stack.maps[0].values()]
        for xform in self.xfrm_stack.maps[1:]:
            eff_xform = tex.mult(eff_xform, xform)  # type: ignore
        return eff_xform

    @property
    def scale_factors(self) -> tuple[float, float]:
        """x/y scale factors"""
        _a, _b, _c, _d, *_ = self.effective_xform
        return math.sqrt(_a**2 + _c**2), math.sqrt(_b**2 + _d**2)

    @property
    def xmaps(self):
        """internal ChainMap 'maps' property"""
        return self.xfrm_stack.maps


def _decode_tj(font: _Font, _b: bytes, xform_stack: _XfrmStack, debug=False) -> dict:
    try:
        if isinstance(font.encoding, str):
            _text = _b.decode(font.encoding, "surrogatepass")
        else:
            _text = "".join(
                font.encoding[x] if x in font.encoding else bytes((x,)).decode() for x in _b
            )
    except (UnicodeEncodeError, UnicodeDecodeError):
        _text = _b.decode("utf-8", "replace")
    _text = "".join(font.char_map[x] if x in font.char_map else x for x in _text)
    return {
        "text": _text,
        "effective_xform": xform_stack.effective_xform,
        "xform_stack": copy(xform_stack.xmaps) if debug else None,
    }


def _recurs_to_target_op(
    ops: Iterator[tuple[list, bytes]],
    xform_stack: _XfrmStack,
    end_target: bytes,
    fonts: dict[str, _Font],
    debug=False,
) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
    """recurse operators between BT/ET and/or q/Q operators managing the xform
    stack and capturing text positioning and rendering data.

    Args:
        ops (Iterator[tuple[list, bytes]]): iterator of operators in content stream
        xform_stack (_XfrmStack): stack of cm/tm transformations to be applied
        end_target (bytes): Allowed values are b"Q" and b"BT"
        fonts (dict[str, _Font]): font dictionary
        debug (bool, optional): Captures all text operator data. Defaults to False.
         NOTE: performance penalty when debug=True.

    Raises:
        ValueError: if multiple fonts appear in a single BT

    Returns:
        tuple[list[dict[str, Any]], list[dict[str, Any]]]: list of dicts of text
         rendered by each BT operator + list of dicts of text rendered by individual
         Tj/TJ operators.
    """
    # 1 dict entry per BT operator. keys: tx, ty, space_width, font_height, font_width, text
    bt_groups: list[dict[str, Any]] = []

    # 1 dict entry per Tj operator. keys: text, effective_xform
    # extra key "xform_stack" added when debug=True
    tj_ops: list[dict[str, Any]] = []

    bt_grp: dict[str, Any] = {}  # current bt_groups dict entry

    if end_target == b"Q":
        # add new q level. cm's added at this level will be popped at next b'Q'
        xform_stack.add_q()

    while True:
        try:
            opands, op = next(ops)
        except StopIteration:
            return bt_groups, tj_ops
        if op == end_target:
            if op == b"Q":
                xform_stack.remove_q()
            if op == b"ET":
                if "tx" not in bt_grp:  # no Td or Tm operators in this BT, use base tx/ty
                    *_, tx, ty = xform_stack.effective_xform
                    bt_grp["tx"] = tx
                    bt_grp["ty"] = ty
                _text = ""
                last_tx = bt_grp["tx"]
                last_len = 0
                font_width = bt_grp["font_width"]
                for _tj in tj_ops:  # ... build text from new Tj operators
                    tx = _tj["effective_xform"][4]
                    distance = tx - last_tx
                    excess = distance - (last_len * font_width)
                    new_text = f'{" " * int(excess // font_width)}{_tj["text"]}'
                    _text = f"{_text}{new_text}"
                    last_tx = tx
                    last_len = len(new_text)
                bt_grp["text"] = _text
                bt_groups.append(bt_grp)
                xform_stack.reset_tm()
            return bt_groups, tj_ops
        if op == b"q":
            bts, tjs = _recurs_to_target_op(ops, xform_stack, b"Q", fonts, debug)
            bt_groups.extend(bts)
            tj_ops.extend(tjs)
        if op == b"cm":
            xform_stack.add_cm(*opands)
        if op == b"BT":
            bts, tjs = _recurs_to_target_op(ops, xform_stack, b"ET", fonts, debug)
            bt_groups.extend(bts)
            tj_ops.extend(tjs)
        if op == b"Tj":
            tj_ops.append(_decode_tj(fonts[bt_grp["font_name"]], opands[0], xform_stack, debug))
        if op == b"TJ":
            _ft = fonts[bt_grp["font_name"]]
            for tj_op in opands[0]:
                if isinstance(tj_op, bytes):
                    tj_ops.append(_decode_tj(_ft, tj_op, xform_stack, debug))
                elif str(tj_op).isnumeric():
                    xform_stack.add_tm([tj_op, 0])
        if op in (b"Td", b"Tm"):
            if op == b"Tm":
                xform_stack.reset_tm()
            xform_stack.add_tm(opands)
            *_, tx, ty = xform_stack.effective_xform
            bt_grp["tx"] = bt_grp.get("tx", tx)
            bt_grp["ty"] = bt_grp.get("ty", ty)
        if op == b"Tf":
            if "space_width" in bt_grp:
                raise ValueError("multiple fonts")
            xform_stack.reset_tm()
            x_scale, y_scale = xform_stack.scale_factors
            bt_grp["space_width"] = fonts[opands[0]].space_width * x_scale
            bt_grp["font_height"] = opands[1] * y_scale
            bt_grp["font_width"] = opands[1] * x_scale
            bt_grp["font_name"] = opands[0]


def extract_structured_text(pg: PageObject, space_expansion_factor=0.1, debug_file=None) -> str:
    """Get text from pypdf page preserving fidelity to rendered position

    Args:
        pg (PageObject): a pypdf PdfReader page
        space_expansion_factor (float, optional): higher values result in more spacing
         between text rendered in different BT operators. Defaults to 2.0.
        debug_file (str, optional): full path + filename prefix for debug output.
         Defaults to None. NOTE: significantly higher memory and processor usage.

    Returns:
        str: multiline string containing page text structured as it appeared in the
         source pdf.
    """
    # Font retrieval logic adapted from pypdf.PageObject._extract_text()
    objr = pg
    while NameObject(PG.RESOURCES) not in objr:
        objr = objr["/Parent"].get_object()  # type: ignore
    resources_dict = cast(DictionaryObject, objr[PG.RESOURCES])
    fonts: dict[str, _Font] = {}
    if "/Font" in resources_dict:
        for font_name in resources_dict["/Font"]:  # type: ignore
            cmap = _cmap.build_char_map(font_name, 200.0, pg)
            fonts[font_name] = _Font(*cmap[1:-1])
    if debug_file:
        Path(f"{debug_file}.fonts.json").write_text(
            json.dumps(fonts, indent=2, default=lambda x: getattr(x, "_asdict", str)(x)), "utf-8"
        )
    x_stack = _XfrmStack()  # transformation stack manager
    ops = iter(ContentStream(pg["/Contents"].get_object(), pg.pdf, "bytes").operations)
    bt_groups: list[dict] = []  # BT operator dict
    # keys: tx, ty, space_width, font_height, text
    tj_debug: list[dict] = []  # Tj/TJ operator data (debug only)
    try:
        debug = bool(debug_file)
        while True:
            _, op = next(ops)
            if op in (b"BT", b"q"):
                end_op = b"ET" if op == b"BT" else b"Q"
                bts, tjs = _recurs_to_target_op(ops, x_stack, end_op, fonts, debug)
                bt_groups.extend(bts)
                if debug:
                    tj_debug.extend(tjs)
    except StopIteration:
        pass

    # left align the data, i.e. decrement all tx values by min(tx)
    min_x = min(x["tx"] for x in bt_groups)
    meets_len = any(len(x["text"]) > 10 for x in bt_groups)
    bt_groups = [
        ogrp | {"tx": ogrp["tx"] - min_x} | ({"rawtx": ogrp["tx"]} if debug_file else {})
        for ogrp in sorted(bt_groups, key=lambda x: (x["ty"], -x["tx"]), reverse=True)
    ]
    if debug_file:
        Path(f"{debug_file}.bt.json").write_text(
            json.dumps(bt_groups, indent=2, default=str), "utf-8"
        )
        Path(f"{debug_file}.tj.json").write_text(
            json.dumps(tj_debug, indent=2, default=str), "utf-8"
        )
    # group the text operations by rendered y coordinate, i.e. the line number
    ty_groups = {
        ty: sorted(grp, key=lambda x: x["tx"])
        for ty, grp in groupby(bt_groups, key=lambda bt_grp: int(bt_grp["ty"]))
    }
    # combine groups whose y coordinates differ by less than the effective font height
    # (accounts for mixed fonts and other minor oddities)
    last_ty = list(ty_groups)[0]
    for ty in list(ty_groups)[1:]:
        text_groups = ty_groups[ty]
        this_fsz = text_groups[0]["font_height"]
        fsz = min((ty_groups[last_ty][0]["font_height"], this_fsz))
        txs = set(int(_t["tx"]) for _t in text_groups if _t["text"].strip())
        last_txs = set(int(_t["tx"]) for _t in ty_groups[last_ty] if _t["text"].strip())
        # prevent merge if both groups are rendering in the same x position.
        no_text_overlap = not any(chk in last_txs for chk in txs)
        offset_less_than_font_height = abs(ty - last_ty) < fsz
        if no_text_overlap and offset_less_than_font_height:
            ty_groups[last_ty] = sorted(
                ty_groups.pop(ty) + ty_groups[last_ty], key=lambda x: x["tx"]
            )
        else:
            last_ty = ty
    if debug_file:
        Path(f"{debug_file}.bt_line_groups.json").write_text(
            json.dumps(ty_groups, indent=2, default=str), "utf-8"
        )
    sp_width = min(
        (
            (p2["tx"] - p1["tx"]) / (len(p1["text"]) + space_expansion_factor)
            for tdictlist in ty_groups.values()
            for p1, p2 in pairwise(tdictlist)
            if p1["text"].strip()
            and (len(p1["text"].strip()) > 10 or not meets_len)
            and int(p2["tx"] - p1["tx"]) > 0
        ),
        default=0.1,
    )
    lines: list[str] = []
    for line_data in ty_groups.values():
        line = ""
        for bt_op in line_data:
            offset = int(bt_op["tx"] // sp_width)
            spaces = offset - len(line)
            line = f"{line}{' ' * spaces}{bt_op['text']}"
        lines.append(line)
    return "\n".join(ln.rstrip() for ln in lines if ln.strip())

Example Epic PDF page stripped of PHI:
FB01219A86F94518818875AB0828B31D_pg1.PDF

NOTE: pypdf splitting operations also perform poorly on documents generated by the Epic engine. Epic stores ALL image data in a single shared Resources dictionary, and pypdf does NOT provide support for the removal of unrendered images in such a resource. I've hacked together another routine that replaces all unrendered images with an empty bytes object yielding an enormous reduction in required storage for large, multi-split pdfs, but if you examine the internal structure of this sample, you'll notice that all of the /img# named image references (along with their corresponding indirect object references) are still present with empty streams. Should I create a separate discussion thread for this??

Edit

I tried out a few non-Epic PDFs to see what'd happen and found a few bugs. Also made some small performance and text encoding improvements. That's it for now. My codebase is working for me, so stopping there until it's not... ;)

0 replies

shartzog · 2023-12-02T03:52:13Z

shartzog
Dec 2, 2023

I have what I believe to be a fully functional "layout mode" implementation based on the concepts above with none of the caveats. As implemented, structural fidelity is on par with what I consider to be SOTA in open source tools, namely pdftotext.PDF([pdf io obj], physical=True). The initial, bare bones implementation is located at https://github.com/hank-ai/pypdftotext and can be installed via pip from test.pypi.org if anyone's interested in taking it for a spin. Preinstallation of pypdf is required for now since the install comes from pypi's test index. There's only one pypdf package in pypi test and it's dated 2016 with version 0.1 (lol), so to install, run the following:

pip install pypdf
pip install -i https://test.pypi.org/simple/ pypdftotext

Once installed, usage in python is as follows:

from pathlib import Path
import pypdftotext
pdf = Path("some_pdf.pdf")  # can be bytes, Path, PdfReader, or io.BytesIO; used Path for convenience
pdf_text = pypdftotext.pdf_text(pdf)
print(pdf_text)

Top level functions pypdftotext.pdf_text_pages() (returns a list of strings, one per pdf page) and pypdftotext.extract_structured_text() (returns the text of a single page when passed a pypdf PageObject) are also available.

Performance Comparisons

Example 1

Source file:

Claim Maker Alerts Guide_pg2.PDF

pypdf PageObject.extract_text() output:

 Updated System Responses for Common Scenarios 
 Scenario  Before Change  After  Why? 
 An On Hold / Missing 
 Documents case receives its 
 first documentation set after 
 coding operations have 
 already begun for the batch 
 (batch state = In Progress).  New doc info was 
 logged but no 
 further automated 
 action was taken.  Leave state as On 
 Hold and update state 
 reason to Ready To 
 Code.  Batches can be released early 
 and coders can code all they can 
 and then leave the batch in In 
 Progress. When docs come in, 
 the case is picked up by the 
 normal On Hold process due to 
 the assignment of the Ready to 
 Code state reason. 
 An “incomplete” case (not 
 Code Completed or Ignored) 
 in an “in flight” batch (state = 
 Reconciled, Assigned, or In 
 Progress) receives new 
 documents.  All documents 
 were “overwritten” 
 with data from the 
 new documents.  All manually attached 
 PDFs are preserved 
 in place and all 
 “extracted” 
 documents are 
 aggregated under a 
 SUPERSEDED ON 
 [DATE] text doc with 
 type Complete 
 Record.  Ensures that ALL info that has 
 arrived for the case remains 
 visible to users. Specifically 
 addresses split labor / C-section 
 cases, allowing a coder to refer 
 back to the “Superseded” 
 documents to make sure a newly 
 extracted “C-section only” 
 document wasn’t really a Labor 
 to C-section case. 
 New documents are received 
 for a Code Completed or 
 Ignored case in an “in flight” 
 batch.  New doc info was 
 logged but no 
 further automated 
 action was taken.  Existing documents 
 are “superseded” 
 (see previous) and 
 the case is set back 
 to On Hold / Ready to 
 Code.  Prompts the coder to review the 
 new documentation set while 
 retaining all previously applied 
 codes.  If no significant change is 
 noted, the case can simply be set 
 back to Code Completed. 
 Documentation for an 
 “uncoded” (aka not Code 
 Completed) case or a new 
 patient is received for a 
 Complete or Charges Entered 
 batch.  New case info 
 was logged but 
 no further 
 automated action 
 was taken.  The case is added to 
 a new batch with the 
 same date of service. 
 Set state to Ignored 
 on the original case (if 
 it exists) and add 
 notes to both the 
 original and new 
 cases indicating the 
 link between the two.  Ensures proper review of any 
 additional documentation 
 received for a previously 
 completed batch as well as 
 documentation for brand new 
 cases after a batch has already 
 been Completed. Notes on the 
 original and duplicate case 
 ensure that users are aware of 
 actions taken by the system. 
 Documentation for a Code 
 Completed case in a 
 Complete or Charges Entered 
 batch is received.  New doc info was 
 logged but no 
 further automated 
 action was taken.  Existing case 
 documents are left in 
 place and the new 
 documentation is 
 added as a PDF 
 attachment with type 
 “complete record” and 
 title POSTED LATE - 
 [DATE].  The status of the new document 
 is clearly indicated as arriving 
 AFTER the associated case was 
 coded avoiding potential 
 confusion regarding which 
 documentation was utilized at the 
 time of coding while also 
 providing access to the new info 
 and allowing the end user to 
 determine the correct course of 
 action.

pypdftotext output:

 Updated System Responses for Common Scenarios


  Scenario                                 Before Change             After                           Why?

  An On Hold / Missing                     New doc info was          Leave state as On               Batches can be released early
  Documents case receives its              logged but no             Hold and update state           and coders can code all they can
  first documentation set after            further automated         reason to Ready To              and then leave the batch in In
  coding operations have                   action was taken.         Code.                           Progress. When docs come in,
  already begun for the batch                                                                        the case is picked up by the
  (batch state = In Progress).                                                                       normal On Hold process due to
                                                                                                     the assignment of the Ready to
                                                                                                     Code state reason.

  An “incomplete” case (not                All documents             All manually attached           Ensures that ALL info that has
  Code Completed or Ignored)               were “overwritten”        PDFs are preserved              arrived for the case remains
  in an “in flight” batch (state =         with data from the        in place and all                visible to users. Specifically
  Reconciled, Assigned, or In              new documents.            “extracted”                     addresses split labor / C-section
  Progress) receives new                                             documents are                   cases, allowing a coder to refer
  documents.                                                         aggregated under a              back to the “Superseded”
                                                                     SUPERSEDED ON                   documents to make sure a newly
                                                                     [DATE] text doc with            extracted “C-section only”
                                                                     type Complete                   document wasn’t really a Labor
                                                                     Record.                         to C-section case.

  New documents are received               New doc info was          Existing documents              Prompts the coder to review the
  for a Code Completed or                  logged but no             are “superseded”                new documentation set while
  Ignored case in an “in flight”           further automated         (see previous) and              retaining all previously applied
  batch.                                   action was taken.         the case is set back            codes.  If no significant change is
                                                                     to On Hold / Ready to           noted, the case can simply be set
                                                                     Code.                           back to Code Completed.

  Documentation for an                     New case info             The case is added to            Ensures proper review of any
  “uncoded” (aka not Code                  was logged but            a new batch with the            additional documentation
  Completed) case or a new                 no further                same date of service.           received for a previously
  patient is received for a                automated action          Set state to Ignored            completed batch as well as
  Complete or Charges Entered              was taken.                on the original case (if        documentation for brand new
  batch.                                                             it exists) and add              cases after a batch has already
                                                                     notes to both the               been Completed. Notes on the
                                                                     original and new                original and duplicate case
                                                                     cases indicating the            ensure that users are aware of
                                                                     link between the two.           actions taken by the system.

  Documentation for a Code                 New doc info was          Existing case                   The status of the new document
  Completed case in a                      logged but no             documents are left in           is clearly indicated as arriving
  Complete or Charges Entered              further automated         place and the new               AFTER the associated case was
  batch is received.                       action was taken.         documentation is                coded avoiding potential
                                                                     added as a PDF                  confusion regarding which
                                                                     attachment with type            documentation was utilized at the
                                                                     “complete record” and           time of coding while also
                                                                     title POSTED LATE -             providing access to the new info
                                                                     [DATE].                         and allowing the end user to
                                                                                                     determine the correct course of
                                                                                                     action.

Example 2

Source file:

Epic Page.PDF

pypdf PageObject.extract_text() output:

 
All Postprocedure Notes 
Anesthesia Post Evaluation
Procedure Summary  
Date: 10/11/23 Room / Location: EHMC ENDOSCOPY
Anesthesia Start: 0852 Anesthesia Stop: 0918
Procedure: COLONOSCOPY Diagnosis: Cancer screening
Scheduled Providers: Walter A Klein, MD; Danny Chaung, 
DOResponsible Provider: Danny Chaung, DO
Anesthesia Type: general ASA Status: 2
Patient location during evaluation: PACU
Post op Vital Signs: stable
Level of consciousness: awake and alert
Pain management: adequate analgesia
Airway patency: patent
Anesthetic complications: no
Respiratory status: unassisted
Hydration status: continuing
Post-op Complications: No
Assessment: Nausea and Vomiting: absent
MIPS Measure #404 - Smoking Abstinence
Is the patient a current smoker? No (XX404)  
 
 Last edited 10/11/23 0919 by Danny Chaung, DO
Date of Service 10/11/23 0918
Status: Signed

pypdftotext output:

All Postprocedure Notes
   Last edited 10/11/23 0919 by Danny Chaung, DO
   Date of Service 10/11/23 0918
   Status: Signed
Anesthesia Post Evaluation

Procedure Summary

   Date: 10/11/23                                                Room / Location: EHMC ENDOSCOPY
   Anesthesia Start: 0852                                        Anesthesia Stop: 0918
   Procedure: COLONOSCOPY                                        Diagnosis: Cancer screening
   Scheduled Providers: Walter A Klein, MD; Danny Chaung,        Responsible Provider: Danny Chaung, DO
   DO
   Anesthesia Type: general                                      ASA Status: 2


Patient location during evaluation: PACU
Post op Vital Signs: stable

Level of consciousness: awake and alert
Pain management: adequate analgesia
Airway patency: patent
Anesthetic complications: no
Respiratory status: unassisted
Hydration status: continuing
Post-op Complications: No



Assessment: Nausea and Vomiting: absent




MIPS Measure #404 - Smoking Abstinence
Is the patient a current smoker? No (XX404)

If there's any interest in trying to pull this into pypdf itself, I'd be happy to work toward that goal. As implemented, pypdftotext requires python 3.10+, but I don't think it'd be that difficult to adapt for earlier python3 versions. The biggest obstacles are likely to be 3.10+ typing features and usage of the walrus (:=) operator.

6 replies

shartzog Dec 15, 2023

Thanks, Stefan. I'll start working toward a PR.

MartinThoma Dec 16, 2023
Maintainer Author

Having a layout-mode directly within pypdf would be amazing!

It must be added in a backwards-compatible way, though. My suggestion would be to add a extraction_mode parameter to extract_text. For extraction_mode == 'plain' (the default), the current behavior is applied. For extraction_mode == 'layout', your text extraction is applied.

Another suggestion would be to first add it as a private method. That would decouple the public-interface question from the actual implementation.

Currently, we still support Python 3.6. That will change on 1st of January 2024 (in a few days 🎉 ). Hence if you wait a little bit, it might be a bit easier. We still want to support Python 3.7 though.

shartzog Jan 3, 2024

Opened #2388
Still a good many formatting complaints. I think I'm missing something in the pre-commit process.

shartzog Jan 3, 2024

All "style" issues addressed, but I could use some direction on test coverage. I'm only familiar with the basics of pytest and have no experience with Codecov. I can certainly add a basic TEST_CASE/EXPECTATION style test module, but based on examination of existing test modules, I suspect it's not that simple.

stefan6419846 Jan 3, 2024
Maintainer

Thanks for the PR.

Regarding testing: In general, testing with pytest is not much different from plain stdlib unittest or JUnit test for example, except that pytest tends to use function-based tests with corresponding decorators. Codecov just renders the coverage reports generated by coverage.py during the test runs and ensures our test coverage ideally does not decrease. You can always run the coverage calculations locally as well and generate HTML reports through coverage.py for example.

Most of the existing tests tend to be integration tests which run against all sorts of odd/previously breaking PDF files usually submitted within a corresponding issue. Some basic cases use dummy data and/or mocking/patching.

In your case, I would probably go with the approach of real files for the high-level functionality, but use synthetic data or handcrafted example files for the internal implementation to ensure that all edge cases are covered appropriately without having to rely on external sources of unclear copyright or which are hard to generate.

Biigode · 2024-01-02T14:32:53Z

Biigode
Jan 2, 2024

This code was merged on pypdf ?

4 replies

stefan6419846 Jan 2, 2024
Maintainer

No, the corresponding PR seems to still be WIP and has not yet been opened.

Biigode Jan 2, 2024

Thanks

shartzog Jan 2, 2024

Yep, sorry I'm dragging a bit. Had a busy holiday season. I'll be working on this over the next week or two. More to come.

MartinThoma Jan 3, 2024
Maintainer Author

Thank you shartzog <3 Don't worry, we don't have time pressure. It should be fun to contribute :-)

I've marked this with the pypdf==4.0.0 milestone. It would be nice to have it in there so that users can see an advantage of switching to the next version :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Extraction Improvements #2038

{{title}}

Replies: 6 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Source file:

pypdf PageObject.extract_text() output:

pypdftotext output:

Source file:

pypdf PageObject.extract_text() output:

pypdftotext output:

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Text Extraction Improvements #2038

MartinThoma Jul 29, 2023 Maintainer

Replies: 6 comments · 15 replies

pubpub-zz Jul 30, 2023 Maintainer

pubpub-zz Jul 30, 2023 Maintainer

MartinThoma Jul 30, 2023 Maintainer Author

pubpub-zz Jul 30, 2023 Maintainer

MartinThoma Jul 30, 2023 Maintainer Author

pubpub-zz Jul 30, 2023 Maintainer

MartinThoma Jul 30, 2023 Maintainer Author

MartinThoma Aug 1, 2023 Maintainer Author

Edit

Performance Comparisons

Source file:

pypdf PageObject.extract_text() output:

pypdftotext output:

Source file:

pypdf PageObject.extract_text() output:

pypdftotext output:

MartinThoma Dec 16, 2023 Maintainer Author

stefan6419846 Jan 3, 2024 Maintainer

stefan6419846 Jan 2, 2024 Maintainer

MartinThoma Jan 3, 2024 Maintainer Author

MartinThoma
Jul 29, 2023
Maintainer

Replies: 6 comments 15 replies

pubpub-zz
Jul 30, 2023
Maintainer

pubpub-zz
Jul 30, 2023
Maintainer

MartinThoma Jul 30, 2023
Maintainer Author

pubpub-zz
Jul 30, 2023
Maintainer

MartinThoma Jul 30, 2023
Maintainer Author

pubpub-zz Jul 30, 2023
Maintainer

MartinThoma Jul 30, 2023
Maintainer Author

MartinThoma Aug 1, 2023
Maintainer Author

MartinThoma Dec 16, 2023
Maintainer Author

stefan6419846 Jan 3, 2024
Maintainer

stefan6419846 Jan 2, 2024
Maintainer

MartinThoma Jan 3, 2024
Maintainer Author