You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Also note the follow up block with similar class name, but different content just works fine:
<div class="FulltextWrapper">
<div xmlns="" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:meta="http://www.springer.com/app/meta" class="MainTitleSection">
<h1 xmlns="http://www.w3.org/1999/xhtml" class="ArticleTitle" lang="en">The linguistic validation of the gut feelings questionnaire in three European languages</h1>
</div>
...
So I do not have exact explanation what combination of XML/HTML blocks intersect in XPATH processing, but ultimately it goes to fulltext naming.
Here is a sample HTML code:
Where Trafilatura fails in processing it correcty.
The root issue appears to be in
FulltextWrapper
in particular inFulltext
pattern recognition in too general XPATH rule processing, such as:trafilatura/trafilatura/xpaths.py
Line 45 in 42ada5a
Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.
The particular call that results in the error is:
trafilatura.bare_extraction(html_content)
Would be great to see those cases handled. Thanks for all the great work!
The text was updated successfully, but these errors were encountered: