Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode encoding problems while check stop words #136

Open
vladimir-shmidt opened this issue Aug 10, 2014 · 3 comments
Open

Unicode encoding problems while check stop words #136

vladimir-shmidt opened this issue Aug 10, 2014 · 3 comments

Comments

@vladimir-shmidt
Copy link

Have tried to extract russian article but gosse produced empty result. I tried to debug and have found out that extracted content (text from p tag) can not be found in loaded stop list. But it is 100% in the stop list. So i suppose it is the string eqauls problem in python or something fimilar. In the right bottom coner i've added watch items. So it is currnet word. Eqauls result of set and stop word position of current word.
image

@vladimir-shmidt
Copy link
Author

suppose changes in class StopWords(object):
self._cached_stop_words[language] = set(FileHelper.loadResourceFile(path).encode('utf-8').splitlines())
will solve the issue

@grangier
Copy link
Owner

I supposed you're stopword file is not correctly encoded

@vladimir-shmidt
Copy link
Author

i haven't changed anything with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants