Issues with xpath processing along the "FullText" path template recognition. #780

krstp · 2025-01-29T17:45:57Z

Here is a sample HTML code:

<div class="FulltextWrapper">
    ...
</div>

Where Trafilatura fails in processing it correcty.

The root issue appears to be in FulltextWrapper in particular in Fulltext pattern recognition in too general XPATH rule processing, such as:

trafilatura/trafilatura/xpaths.py

Line 45 in 42ada5a

contains(translate(@class, "FULTEX","fultex"), "fulltext")

Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.

The particular call that results in the error is: trafilatura.bare_extraction(html_content)

Would be great to see those cases handled. Thanks for all the great work!

The text was updated successfully, but these errors were encountered:

krstp · 2025-01-29T17:57:15Z

Additional notes:

It seems the issue is directly related to the neighboring code structure, such as:

            <div class="container">
                <div id="main" class="layout">
                    <div class="layout__main--wide" id="main-content">
                        <div class="block" id="Test-ImgSrc">

<!--- WRONG ----->
                            <div class="FulltextWrapper">
                               ...
                            </div>
<!--- WRONG ----->

Also note the follow up block with similar class name, but different content just works fine:

                            <div class="FulltextWrapper">
                                <div xmlns="" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:meta="http://www.springer.com/app/meta" class="MainTitleSection">
                                    <h1 xmlns="http://www.w3.org/1999/xhtml" class="ArticleTitle" lang="en">The linguistic validation of the gut feelings questionnaire in three European languages</h1>
                                </div>
...

So I do not have exact explanation what combination of XML/HTML blocks intersect in XPATH processing, but ultimately it goes to fulltext naming.

adbar added the bug Something isn't working label Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with xpath processing along the "FullText" path template recognition. #780

Issues with xpath processing along the "FullText" path template recognition. #780

krstp commented Jan 29, 2025 •

edited

Loading

krstp commented Jan 29, 2025

Issues with xpath processing along the "FullText" path template recognition. #780

Issues with xpath processing along the "FullText" path template recognition. #780

Comments

krstp commented Jan 29, 2025 • edited Loading

krstp commented Jan 29, 2025

krstp commented Jan 29, 2025 •

edited

Loading