Skip to content

Faulty extraction for very short documents #660

@Psynbiotik

Description

@Psynbiotik

This example shows that data is duplicated and words are squished together even though they are distinct in the html.

python:

from trafilatura import extract

html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
    <section>
        <p>First</p>
        This gets Squished
        <div>
            <h4>There should be a space</h4>
            <p>Another sentence</p>
            This also gets Squished
        </div>
        <div>
            <h4>Where is the space</h4>
            <p>This sentence has to be long enough.</p>
        </div>
    </section>
</main>
</body>
</html>
"""

print(extract(html_string))

This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'

You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions