Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unchecked input language field can cause IOError #104

Open
uncommoncode opened this issue May 13, 2014 · 2 comments
Open

Unchecked input language field can cause IOError #104

uncommoncode opened this issue May 13, 2014 · 2 comments

Comments

@uncommoncode
Copy link

There is a repeatable error with some malformed HTML language meta tags that causes an IOError within goose. This is due to trusting the meta tag input in the OutputFormatter.get_language command in outputformatters.py:

    def get_language(self, article):
...
                return article.meta_lang[:2]
...

This is the error message:

Traceback (most recent call last):
  File "bug.py", line 19, in test_language
    content = g.extract(raw_html=html)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/__init__.py", line 63, in crawl
    article = crawler.crawl(crawl_candiate)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/crawler.py", line 131, in crawl
    self.article.cleaned_text = self.formatter.get_formatted_text()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/outputformatters.py", line 66, in get_formatted_text
    self.remove_fewwords_paragraphs()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/outputformatters.py", line 123, in remove_fewwords_paragraphs
    stop_words = self.stopwords_class(language=self.get_language()).get_stopword_count(text)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/text.py", line 98, in __init__
    self._cached_stop_words[language] = set(FileHelper.loadResourceFile(path).splitlines())
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/utils/__init__.py", line 79, in loadResourceFile
    raise IOError("Couldn't open file %s" % path)
IOError: Couldn't open file /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/resources/text/stopwords-un.txt

This is an example that replicates this failure:

import unittest

import goose
class UncheckedInput(unittest.TestCase):
    def test_language(self):
        html = """
        <html>
            <head>
                <meta name="dcterms.language" content="und" />
            </head>
            <body>
                <div class="body">
                    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. In fermentum mollis tortor a placerat. Donec pretium, ipsum in vestibulum mollis, lorem tortor suscipit metus, ac tristique ipsum neque id metus. Suspendisse potenti. Integer id neque lorem. Aliquam in purus felis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque viverra, tortor eu aliquet facilisis, leo enim laoreet urna, a pulvinar nisl est id risus. Proin vel nibh fringilla, rhoncus nunc vel, aliquam felis.</p>
                </div>
            </body>
        </html>
        """
        g = goose.Goose()
        content = g.extract(raw_html=html)

if __name__ == '__main__':
    unittest.main()

This can be fixed by checking input against a set of allowed languages/stop wordlists.

@bmuller
Copy link

bmuller commented Jun 11, 2014

+1 for getting this fixed. Seems like an easy fix - but a nasty bug. This exception gets thrown all the time w/ some basic crawling.

@reynhout
Copy link

Another similar failure:

http://www.nasa.gov/press/2014/september/nasa-s-mars-curiosity-rover-arrives-at-martian-mountain/

contains this tag:

  <meta name="dc.language" content="und" />

And tries to load stopwords-un.txt, which raises IOError because the file does not exist.

This tag/attribute/key-value is Dublin Core metadata: http://dublincore.org/, but not really HTML, per se, and probably not a good place to try to determine page language. I can't find a reference for the meaning of "und" ... possibly "undetermined" as defined by NASA's CMS software?

It's getting picked up as a language definition by the parser, which uses any <meta> tag with an attribute matching a regex of "lang".

This might be a second bug: the stopwords file should definitely be checked for existence before trying to read it, but the regex also seems overly broad. But possibly this is intentional -- I haven't run it against a huge corpus and had to deal with the fallout.

For now, I'm getting around it by setting RE_LANG in goose/extractors.py to a regex matching the current list of stopword languages, case-insensitively:

-RE_LANG = r'^[A-Za-z]{2}$'
+RE_LANG = r'(?i)^(ar|da|de|en|es|fi|fr|hu|id|it|ko|nb|nl|no|pl|pt|ru|sv|zh)$'

get_meta_lang() checks its determination of page language against RE_LANG before returning it. If there's no match, it returns None instead. The extractor then defaults to en, which fixes the failures in my corpus.

Obvious drawback to this solution: RE_LANG would need to be kept in sync with the list of stopword dictionaries.

reynhout added a commit to reynhout/python-goose that referenced this issue Sep 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants