-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug about remove_markup #3520
Comments
Thank you for your report & code-to-reproduce! In looking over the code, the more basic problem might be that It looks like the I think it'd be better to tune those regexes to not assume the absence of all nested tags, but that might risk other side-effects, or require other re-ordering of steps – I'm not sure why the existing regexes work the way they do, and processing HTML or Wikipedia's weird It might be most robust to move some form of |
Yes, my suggestion is not perfect. Do you have better method for processing HTML or Wikipedia's weird wikitext format without regexes? |
Problem description
After calling gensim.corpora.wikicorpus.filter_wiki,there are still characters not been stripped.
Before stripping RE_P1, characters as following should be stripped.
Steps/code/corpus to reproduce
output:
2022年末,全省总人口为2347.69万人https://www.hongheiku.com/sjrk/1059.html,其中城镇常住人口1496.18万人,占总人口比重(常住人口城镇化率)为63.73%,比上年末提高0.37个百分点。户籍人口城镇化率为49.08%。全年出生人口10.23万人,出生率为4.33‰;死亡人口19.84万人,死亡率为8.40‰;自然增长率为-4.07‰。人口性别比为99.83(以女性为100)。
羅賓斯認為,此定義注重的不是以經濟學「研究某些行為」,而是要以分析的角度去「研究行為是如何被資源有限的條件所改變」。一些人批評此定義過度廣泛,而且無法將分析範疇侷限在對於市場的研究上。然而,自從1960年代起,由於理性選擇理論和其引發的賽局理論不斷將經濟學的研究領域擴張,這個定義已為世所認 Stigler, George J. (1984). "Economics—The Imperial Science?" ''Scandinavian Journal of Economics'', 86(3), pp. 301-313.,但仍有對此定義的批評。
=============
2022年末,全省总人口为2347.69万人,其中城镇常住人口1496.18万人,占总人口比重(常住人口城镇化率)为63.73%,比上年末提高0.37个百分点。户籍人口城镇化率为49.08%。全年出生人口10.23万人,出生率为4.33‰;死亡人口19.84万人,死亡率为8.40‰;自然增长率为-4.07‰。人口性别比为99.83(以女性为100)。
羅賓斯認為,此定義注重的不是以經濟學「研究某些行為」,而是要以分析的角度去「研究行為是如何被資源有限的條件所改變」。一些人批評此定義過度廣泛,而且無法將分析範疇侷限在對於市場的研究上。然而,自從1960年代起,由於理性選擇理論和其引發的賽局理論不斷將經濟學的研究領域擴張,這個定義已為世所認,但仍有對此定義的批評。
Versions
Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
Bits 64
NumPy 1.26.4
SciPy 1.12.0
gensim 4.3.2
FAST_VERSION 0
wiki text from zhwiki-20231201-pages-articles-multistream1.xml-p1p187712.bz2
The text was updated successfully, but these errors were encountered: