Faulty extraction for very short documents

This example shows that data is duplicated and words are squished together even though they are distinct in the html.

python:
```
from trafilatura import extract

html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
    <section>
        <p>First</p>
        This gets Squished
        <div>
            <h4>There should be a space</h4>
            <p>Another sentence</p>
            This also gets Squished
        </div>
        <div>
            <h4>Where is the space</h4>
            <p>This sentence has to be long enough.</p>
        </div>
    </section>
</main>
</body>
</html>
"""

print(extract(html_string))
```


This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'


You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Faulty extraction for very short documents #660

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Faulty extraction for very short documents #660

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions