Not extracting UL LI text #103

raistie · 2014-05-08T14:59:01Z

Bulleted points in articles are not extracted and are totally missing from the extracted text

grangier · 2014-05-08T20:43:54Z

@raistie please provide URLs that fails

grangier · 2014-05-08T20:48:01Z

duplicate #42

raistie · 2014-05-09T00:10:39Z

http://psychcentral.com/blog/archives/2014/05/08/5-ways-to-stop-a-worry-filled-what-if-cycle/
would be one of the many i have tried

grangier · 2014-05-09T10:14:41Z

Problem is the ul and li tags are cleaned before the main node is detected (to remove menu and all other useless node). If we keep the ul li, we may add a lot of non wanted content to the extracted texte. I'm not sure how we can deal with it

raistie · 2014-05-09T17:41:52Z

I think the thing is these days articles have a lot of titles like "10 ways to...", "5 steps to..." That come in list form

As more and more articles that way, this will miss out a lot of the bullet points that is in the article - which will miss important information

raistie · 2014-05-10T04:37:35Z

Just tried boilerpipe - it extracts the bulleted points ok.

grangier closed this as completed May 10, 2014

codelucas mentioned this issue May 10, 2014

Not extracting UL LI text codelucas/newspaper#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not extracting UL LI text #103

Not extracting UL LI text #103

raistie commented May 8, 2014

grangier commented May 8, 2014

grangier commented May 8, 2014

raistie commented May 9, 2014

grangier commented May 9, 2014

raistie commented May 9, 2014

raistie commented May 10, 2014

Not extracting UL LI text #103

Not extracting UL LI text #103

Comments

raistie commented May 8, 2014

grangier commented May 8, 2014

grangier commented May 8, 2014

raistie commented May 9, 2014

grangier commented May 9, 2014

raistie commented May 9, 2014

raistie commented May 10, 2014