Unicode encoding problems while check stop words #136

vladimir-shmidt · 2014-08-10T15:44:19Z

Have tried to extract russian article but gosse produced empty result. I tried to debug and have found out that extracted content (text from p tag) can not be found in loaded stop list. But it is 100% in the stop list. So i suppose it is the string eqauls problem in python or something fimilar. In the right bottom coner i've added watch items. So it is currnet word. Eqauls result of set and stop word position of current word.

vladimir-shmidt · 2014-08-10T15:57:19Z

suppose changes in class StopWords(object):
self._cached_stop_words[language] = set(FileHelper.loadResourceFile(path).encode('utf-8').splitlines())
will solve the issue

grangier · 2014-08-11T11:25:56Z

I supposed you're stopword file is not correctly encoded

vladimir-shmidt · 2014-08-11T14:50:54Z

i haven't changed anything with it.

vladimir-shmidt mentioned this issue Aug 11, 2014

Switching to beautifulsoup4 for Python 3 support? #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode encoding problems while check stop words #136

Unicode encoding problems while check stop words #136

vladimir-shmidt commented Aug 10, 2014

vladimir-shmidt commented Aug 10, 2014

grangier commented Aug 11, 2014

vladimir-shmidt commented Aug 11, 2014

Unicode encoding problems while check stop words #136

Unicode encoding problems while check stop words #136

Comments

vladimir-shmidt commented Aug 10, 2014

vladimir-shmidt commented Aug 10, 2014

grangier commented Aug 11, 2014

vladimir-shmidt commented Aug 11, 2014