-
-
Notifications
You must be signed in to change notification settings - Fork 334
Open
Labels
enhancementNew feature or requestNew feature or request
Description
This example shows that data is duplicated and words are squished together even though they are distinct in the html.
python:
from trafilatura import extract
html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
<section>
<p>First</p>
This gets Squished
<div>
<h4>There should be a space</h4>
<p>Another sentence</p>
This also gets Squished
</div>
<div>
<h4>Where is the space</h4>
<p>This sentence has to be long enough.</p>
</div>
</section>
</main>
</body>
</html>
"""
print(extract(html_string))
This results in this:
'First
This gets SquishedThere should be a space
Another sentence
This also gets SquishedWhere is the space
This sentence has to be long enough.
First
This gets SquishedAnother sentence
This also gets SquishedThis sentence has to be long enough.'
You can see First appears 2x even though it's in the html only once, same as some other sentences. Also several words get squished together with the space between them removed.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request