Unchecked input language field can cause IOError #104

uncommoncode · 2014-05-13T01:32:57Z

There is a repeatable error with some malformed HTML language meta tags that causes an IOError within goose. This is due to trusting the meta tag input in the OutputFormatter.get_language command in outputformatters.py:

    def get_language(self, article):
...
                return article.meta_lang[:2]
...

This is the error message:

Traceback (most recent call last):
  File "bug.py", line 19, in test_language
    content = g.extract(raw_html=html)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/__init__.py", line 63, in crawl
    article = crawler.crawl(crawl_candiate)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/crawler.py", line 131, in crawl
    self.article.cleaned_text = self.formatter.get_formatted_text()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/outputformatters.py", line 66, in get_formatted_text
    self.remove_fewwords_paragraphs()
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/outputformatters.py", line 123, in remove_fewwords_paragraphs
    stop_words = self.stopwords_class(language=self.get_language()).get_stopword_count(text)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/text.py", line 98, in __init__
    self._cached_stop_words[language] = set(FileHelper.loadResourceFile(path).splitlines())
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/utils/__init__.py", line 79, in loadResourceFile
    raise IOError("Couldn't open file %s" % path)
IOError: Couldn't open file /usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/goose/resources/text/stopwords-un.txt

This is an example that replicates this failure:

import unittest

import goose
class UncheckedInput(unittest.TestCase):
    def test_language(self):
        html = """
        <html>
            <head>
                <meta name="dcterms.language" content="und" />
            </head>
            <body>
                <div class="body">
                    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. In fermentum mollis tortor a placerat. Donec pretium, ipsum in vestibulum mollis, lorem tortor suscipit metus, ac tristique ipsum neque id metus. Suspendisse potenti. Integer id neque lorem. Aliquam in purus felis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque viverra, tortor eu aliquet facilisis, leo enim laoreet urna, a pulvinar nisl est id risus. Proin vel nibh fringilla, rhoncus nunc vel, aliquam felis.</p>
                </div>
            </body>
        </html>
        """
        g = goose.Goose()
        content = g.extract(raw_html=html)

if __name__ == '__main__':
    unittest.main()

This can be fixed by checking input against a set of allowed languages/stop wordlists.

The text was updated successfully, but these errors were encountered:

bmuller · 2014-06-11T22:11:56Z

+1 for getting this fixed. Seems like an easy fix - but a nasty bug. This exception gets thrown all the time w/ some basic crawling.

reynhout · 2014-09-12T23:08:41Z

Another similar failure:

http://www.nasa.gov/press/2014/september/nasa-s-mars-curiosity-rover-arrives-at-martian-mountain/

contains this tag:

  <meta name="dc.language" content="und" />

And tries to load stopwords-un.txt, which raises IOError because the file does not exist.

This tag/attribute/key-value is Dublin Core metadata: http://dublincore.org/, but not really HTML, per se, and probably not a good place to try to determine page language. I can't find a reference for the meaning of "und" ... possibly "undetermined" as defined by NASA's CMS software?

It's getting picked up as a language definition by the parser, which uses any <meta> tag with an attribute matching a regex of "lang".

This might be a second bug: the stopwords file should definitely be checked for existence before trying to read it, but the regex also seems overly broad. But possibly this is intentional -- I haven't run it against a huge corpus and had to deal with the fallout.

For now, I'm getting around it by setting RE_LANG in goose/extractors.py to a regex matching the current list of stopword languages, case-insensitively:

-RE_LANG = r'^[A-Za-z]{2}$'
+RE_LANG = r'(?i)^(ar|da|de|en|es|fi|fr|hu|id|it|ko|nb|nl|no|pl|pt|ru|sv|zh)$'

get_meta_lang() checks its determination of page language against RE_LANG before returning it. If there's no match, it returns None instead. The extractor then defaults to en, which fixes the failures in my corpus.

Obvious drawback to this solution: RE_LANG would need to be kept in sync with the list of stopword dictionaries.

…topword dictionary. See grangier#104

reynhout added a commit to reynhout/python-goose that referenced this issue Sep 18, 2014

fix grangier#104: default to "en" when we detect a language with no s…

393d99d

…topword dictionary. See grangier#104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unchecked input language field can cause IOError #104

Unchecked input language field can cause IOError #104

uncommoncode commented May 13, 2014

bmuller commented Jun 11, 2014

reynhout commented Sep 12, 2014

Unchecked input language field can cause IOError #104

Unchecked input language field can cause IOError #104

Comments

uncommoncode commented May 13, 2014

bmuller commented Jun 11, 2014

reynhout commented Sep 12, 2014