-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unchecked input language field can cause IOError #104
Comments
+1 for getting this fixed. Seems like an easy fix - but a nasty bug. This exception gets thrown all the time w/ some basic crawling. |
Another similar failure: http://www.nasa.gov/press/2014/september/nasa-s-mars-curiosity-rover-arrives-at-martian-mountain/ contains this tag: <meta name="dc.language" content="und" /> And tries to load This tag/attribute/key-value is Dublin Core metadata: http://dublincore.org/, but not really HTML, per se, and probably not a good place to try to determine page language. I can't find a reference for the meaning of "und" ... possibly "undetermined" as defined by NASA's CMS software? It's getting picked up as a language definition by the parser, which uses any This might be a second bug: the stopwords file should definitely be checked for existence before trying to read it, but the regex also seems overly broad. But possibly this is intentional -- I haven't run it against a huge corpus and had to deal with the fallout. For now, I'm getting around it by setting -RE_LANG = r'^[A-Za-z]{2}$'
+RE_LANG = r'(?i)^(ar|da|de|en|es|fi|fr|hu|id|it|ko|nb|nl|no|pl|pt|ru|sv|zh)$'
Obvious drawback to this solution: |
…topword dictionary. See grangier#104
There is a repeatable error with some malformed HTML language meta tags that causes an IOError within goose. This is due to trusting the meta tag input in the OutputFormatter.get_language command in outputformatters.py:
This is the error message:
This is an example that replicates this failure:
This can be fixed by checking input against a set of allowed languages/stop wordlists.
The text was updated successfully, but these errors were encountered: