Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with xpath processing along the "FullText" path template recognition. #780

Open
krstp opened this issue Jan 29, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@krstp
Copy link

krstp commented Jan 29, 2025

Here is a sample HTML code:

<div class="FulltextWrapper">
    ...
</div>    

Where Trafilatura fails in processing it correcty.

The root issue appears to be in FulltextWrapper in particular in Fulltext pattern recognition in too general XPATH rule processing, such as:

contains(translate(@class, "FULTEX","fultex"), "fulltext")

Not sure what is the percentage of polluted websites, but seems to me this can be more done in a more fail-safe way.

The particular call that results in the error is: trafilatura.bare_extraction(html_content)

Would be great to see those cases handled. Thanks for all the great work!

@krstp
Copy link
Author

krstp commented Jan 29, 2025

Additional notes:

It seems the issue is directly related to the neighboring code structure, such as:

            <div class="container">
                <div id="main" class="layout">
                    <div class="layout__main--wide" id="main-content">
                        <div class="block" id="Test-ImgSrc">

<!--- WRONG ----->
                            <div class="FulltextWrapper">
                               ...
                            </div>
<!--- WRONG ----->

Also note the follow up block with similar class name, but different content just works fine:

                            <div class="FulltextWrapper">
                                <div xmlns="" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:meta="http://www.springer.com/app/meta" class="MainTitleSection">
                                    <h1 xmlns="http://www.w3.org/1999/xhtml" class="ArticleTitle" lang="en">The linguistic validation of the gut feelings questionnaire in three European languages</h1>
                                </div>
...

So I do not have exact explanation what combination of XML/HTML blocks intersect in XPATH processing, but ultimately it goes to fulltext naming.

@adbar adbar added the bug Something isn't working label Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants