Leading spaces on new lines ignored in <pre><code> tags #1063

dmail00 · 2023-12-13T00:17:27Z

Describe the bug
This problem seems to be related to issue 547 and the mininal example is the test case for it with a slight modification.

class HTML2FPDF creates a self.pdf.text_columns and stores it in self._column this sets the region to ignore leading spaces, which is sensible. The self._paragraph instance is set to the return value of self._column.paragraph().
self._column.paragraph() has a default for skip_leading_spaces as False.
When a pre tag is found in the method handle_starttag a new paragraph is created, no value is provided for skip_leading_spaces , however even if False was provided (which is the defeault) this would be ignored and it would default to the region value, which as I noted about is set sensibibly to False.
I did attempt to set the paragraphs value to False after calling _new_paragraph in the "pre" tag handling but this also has no effect. This is because when collect_lines calls paragraph.build_lines the paragraphs value is set to False but it will then default to the regions value.

Error details
N/A

Minimal code
Please include some minimal Python code reproducing your issue:

from fpdf import FPDF

def test_leading_whitespace():
    pdf = FPDF()
    pdf.add_page()
    pdf.write_html(
        """
<pre><code>
Testing pre-code blocks
    that span multiple lines
and have tabs    and    spaces.
</code></pre>
"""
    )
    pdf.output("test_leading_whitespace.pdf")

If you don't know how to build a minimal reproducible example, please check this tutorial: https://stackoverflow.com/help/minimal-reproducible-example

Environment
Please provide the following information:

Operating System: Windows
Python version: 3.11.3
fpdf2 version used: 2.7.7

The text was updated successfully, but these errors were encountered:

dmail00 · 2023-12-13T00:53:43Z

Here is my hack to fix this and boy is it a hack :)

in html._new_paragraph add a parameter with a default of None. In the method if the param is not None and not the same as the current region value, then created a new region and cached the current. Then when creating the paragraph after ending the current check this value to supply for the constructor

    def _new_paragraph(
        self, align=None, line_height=1.0, top_margin=0, bottom_margin=0,skip_leading_spaces=None
    ):
        self._end_paragraph()
        self.align = align or ""
        if not top_margin and not self.follows_heading:
            top_margin = self.font_size / self.pdf.k
        if skip_leading_spaces is not None and skip_leading_spaces !=self._column.skip_leading_spaces:
            self._cached_column = self._column
            self._column = self.pdf.text_columns(skip_leading_spaces=skip_leading_spaces)
        self._paragraph = self._column.paragraph(
            text_align=align,
            line_height=line_height,
            skip_leading_spaces=skip_leading_spaces if skip_leading_spaces is not None else True,
            top_margin=top_margin,
            bottom_margin=bottom_margin,
        )
        self.follows_trailing_space = True
        self.follows_heading = False

When handling the start pre tag supply the new parameter to _new_paragraph

def handle_starttag(self, tag, attrs):
        if tag == "pre":
            self.font_stack.append((self.font_face, self.font_size, self.font_color))
            self.set_font(self.pre_code_font, self.font_size)
            self._pre_formatted = True
            self._new_paragraph(skip_leading_spaces=False)

In endtag when handling the pre tag assert that there is a cached region and reset it to the default and clear out the cached member instance

def handle_endtag(self, tag):
        ...
        if tag == "pre":
            self._end_paragraph()
            assert self._cached_column is not None
            self._column = self._cached_column
            self._cached_column = None

gmischler · 2023-12-13T08:59:56Z

Thanks for reporting this, @dmail00.

It looks like you're right, and the space skipping feature conflicts with <pre> sections.
Your attempted solution looks way too complicated though.
Since a <pre> tag sets its own state flag before triggerring a new paragraph, a rather small change should do the trick:

        self._paragraph = self._column.paragraph(
            text_align=align,
            line_height=line_height,
            skip_leading_spaces=not self._pre_formatted,
            top_margin=top_margin,
            bottom_margin=bottom_margin,
        )

We could also use some more tests in this area to cover the various possible combinations of tags.

Would you be interested to submit a PR?

dmail00 · 2023-12-13T10:00:49Z

Complicated yes, because there seems to be a conflict between the default value False and a False value being an unset value.
I tried something similar earlier however it does not work because firstly the paragraphs value will use the regions value

https://github.com/py-pdf/fpdf2/blob/master/fpdf/text_region.py#L328

And then even if it did, ._end_paragraph() calls self._column.render() which calls self.collect_lines() which calls paragraph.build_lines which calls MultiLineBreak which still defaults to the regions value if the paragraphs value is False.

https://github.com/py-pdf/fpdf2/blob/master/fpdf/text_region.py#L121

Hence why I swapped out the region and later swapped it back in. As I said it is a hack as I am really not sure the correct method to apply in a clean manner without messing other things up that rely on the False value in paragraph defaulting to the regions value.

When a pre tag is found in the method handle_starttag a new paragraph is created, no value is provided for skip_leading_spaces , however even if False was provided (which is the defeault) this would be ignored and it would default to the region value, which as I noted about is set sensibibly to False.
I did attempt to set the paragraphs value to False after calling _new_paragraph in the "pre" tag handling but this also has no effect. This is because when collect_lines calls paragraph.build_lines the paragraphs value is set to False but it will then default to the regions value.

dmail00 · 2023-12-13T11:00:29Z

Here is the same hack yet more confined to the _end_paragraph method and using @gmischler suggestion.
Still ugly as sin, brittle and doesn't resolve the underlying issue.

 self._paragraph = self._column.paragraph(
            text_align=align,
            line_height=line_height,
            skip_leading_spaces=not self._pre_formatted,
            top_margin=top_margin,
            bottom_margin=bottom_margin,
        )

    def _end_paragraph(self):
        self.align = ""
        if self._paragraph:
            self._column.end_paragraph()
            our_context = (
                self.pdf._pop_local_stack()  # pylint: disable=protected-access
            )
            self._column.render()
            self.pdf._push_local_stack(our_context)  # pylint: disable=protected-access
            self._paragraph = None
            self.follows_trailing_space = True
        if self._pre_formatted:
            if self._pre_started:
                #exited the pre block reset the column
                assert self._cached_column #type: ignore
                self._column = self._cached_column #type: ignore
                self._cached_column = None
            else:
                #next paragraph will be pre contents
                self._cached_column = self._column
                self._column = self.pdf.text_columns(skip_leading_spaces=False)

gmischler · 2023-12-13T11:53:43Z

Complicated yes, because there seems to be a conflict between the default value False and a False value being an unset value.

Ah I see, but then this is the underlying problem that we should really fix, instead of trying to work around it.

The default value for the skip_leading_spaces parameter of ParagraphCollectorMixin.paragraph() should not be False, but rather None. Then we test for skip_leading_spaces is None before applying the regions default. This will make sure that an argument explicitly given to the paragraph will always take precedence, whether it is True or False.

dmail00 · 2023-12-13T12:26:39Z

@gmischler Have you actually tried that code?
As I have mentioned now three times that only handles one part of the problem, what about in build_lines where even if the variable is false (as is the default for the class) it defaults to the region value?

gmischler · 2023-12-13T13:15:13Z

Have you actually tried that code?

No, I just tried to explain the concept.
If you find other places where things are not handled correctly, then of course those should also be fixed.
After the correct argument values or defaults have been picked when creating it, a paragraph should indeed just use its own value in build_lines(), and not refer to the regions settings anymore.

Skipping leading spaces was a relatively late and quick addition to the text region and line wrapping code, for the benefit of the HTML parser. At that time I simply made sure that the existing tests kept giving the correct results (in most cases: more correct than before). Unfortunately the existing tests didn't contain any leading whitespace in a <pre> segment, so I didn't catch this special case. Now that you have identified the problem, if you manage to clean it up for good, more power to you!

Lucas-C · 2024-05-24T09:18:09Z

Just to give a quick update on this topic: this is a confirmed bug, and we would welcome a PR solving this problem! 🙂

@dmail00 provided a good minimal reproduction case that could be easily converted into a unit test in test/html/,
and @gmischler provided a good starting plan to fix this in comment #1063 (comment).

dmail00 added the bug label Dec 13, 2023

Lucas-C assigned gmischler Dec 13, 2023

Lucas-C added html up-for-grabs labels Jan 2, 2024

Lucas-C added the hacktoberfest label May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leading spaces on new lines ignored in <pre><code> tags #1063

Leading spaces on new lines ignored in <pre><code> tags #1063

dmail00 commented Dec 13, 2023 •

edited

Loading

dmail00 commented Dec 13, 2023

gmischler commented Dec 13, 2023

dmail00 commented Dec 13, 2023 •

edited

Loading

dmail00 commented Dec 13, 2023 •

edited

Loading

gmischler commented Dec 13, 2023

dmail00 commented Dec 13, 2023

gmischler commented Dec 13, 2023

Lucas-C commented May 24, 2024 •

edited

Loading

Leading spaces on new lines ignored in <pre><code> tags #1063

Leading spaces on new lines ignored in <pre><code> tags #1063

Comments

dmail00 commented Dec 13, 2023 • edited Loading

dmail00 commented Dec 13, 2023

gmischler commented Dec 13, 2023

dmail00 commented Dec 13, 2023 • edited Loading

dmail00 commented Dec 13, 2023 • edited Loading

gmischler commented Dec 13, 2023

dmail00 commented Dec 13, 2023

gmischler commented Dec 13, 2023

Lucas-C commented May 24, 2024 • edited Loading

dmail00 commented Dec 13, 2023 •

edited

Loading

dmail00 commented Dec 13, 2023 •

edited

Loading

dmail00 commented Dec 13, 2023 •

edited

Loading

Lucas-C commented May 24, 2024 •

edited

Loading