-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not extracting UL LI text #103
Comments
@raistie please provide URLs that fails |
duplicate #42 |
http://psychcentral.com/blog/archives/2014/05/08/5-ways-to-stop-a-worry-filled-what-if-cycle/ |
Problem is the ul and li tags are cleaned before the main node is detected (to remove menu and all other useless node). If we keep the ul li, we may add a lot of non wanted content to the extracted texte. I'm not sure how we can deal with it |
I think the thing is these days articles have a lot of titles like "10 ways to...", "5 steps to..." That come in list form As more and more articles that way, this will miss out a lot of the bullet points that is in the article - which will miss important information |
Just tried boilerpipe - it extracts the bulleted points ok. |
Bulleted points in articles are not extracted and are totally missing from the extracted text
The text was updated successfully, but these errors were encountered: