Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Change the positions of the calls of the visitor-function #2364

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MartinThoma
Copy link
Member

Before the text-visitor-function had been called at each change of the output. But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text. As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation.

In this pull request the texts are sent inside the TJ and Tj operations. This may lead to sending letters instead of words:

    x=264.53, y=403.13, text='etad'
    x=264.53, y=403.13, text='ata'
    x=307.85, y=403.13, text=' '

Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ. The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ. When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python. In case of bad style a local variable current_text_visitor may be introduced.

See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR.

--

This PR is a copy of #1389 The PR#1389 was made a long time ago (before we renamed to pypdf), but it seems still valuable.

This PR migrated the changes to the new codebase. Full credit to rogmann for all of the changes.

Before the text-visitor-function had been called at each change of the output.
But this can lead to wrong coordinates because the output may sent after changing the text-matrix for the next text.
As an example have a look at resources/Sample_Td-matrix.pdf: The text_matrix is computed correctly at the Td-operations but the text had been sent after applying the next transformation.

In this pull request the texts are sent inside the TJ and Tj operations.
This may lead to sending letters instead of words:

```    x=264.53, y=403.13, text='M'
    x=264.53, y=403.13, text='etad'
    x=264.53, y=403.13, text='ata'
    x=307.85, y=403.13, text=' '
```

Therefore there is a second commit which introduces a temporarily visitor inside the processing of TJ.
The temp visitor ist used to collect the letters of TJ which will be sent after processing of TJ.
When setting the temp visitor the original parameter is manipulated. I don't know if this is bad style in python.
In case of bad style a local variable current_text_visitor may be introduced.

See also issue #1377. I haven't checked if #1377 had the Td-matrix-problem or the one to be solved by this PR.

--

This PR is a copy of #1389
The PR#1389 was made a long time ago (before we renamed to pypdf),
but it seems still valuable.

This PR migrated the changes to the new codebase. Full credit
to rogmann for all of the changes.

Co-authored-by: rogmann <[email protected]>
@MartinThoma MartinThoma added the workflow-advanced-text-extraction Getting coordinates, font weight, font type, ... label Dec 24, 2023
@stefan6419846 stefan6419846 added the needs-rebase This PR cannot be merged as the main branch is too different. You need to rebase or merge main. label Feb 23, 2024
@stefan6419846
Copy link
Collaborator

@MartinThoma Have there been any further plans regarding a merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase This PR cannot be merged as the main branch is too different. You need to rebase or merge main. workflow-advanced-text-extraction Getting coordinates, font weight, font type, ...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants