Adding more tools to the benchmark? #3

adbar · 2020-06-26T15:46:06Z

Hi,

Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at

goose3
jusText (especially with a custom configuration)
inscriptis (html-to-txt conversion)
trafilatura (disclaimer: I'm the author).

Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.

For more details please refer to the evaluation I've performed. The code including baselines is available here.

The text was updated successfully, but these errors were encountered:

lopuhin · 2020-06-29T07:51:14Z

hi @adbar thanks for the pointers of the tools and evaluation. Another tool which was referenced elsewhere by @saippuakauppias was https://github.com/go-shiori/go-readability. It would be great to add them, we only need to write a script which outputs results in JSON. PRs are welcome, and I hope to have time to add more tools soon as well, it would be great to have more tools evaluated.

adbar · 2020-07-07T17:44:20Z

Thanks for your answer, I've added JSON to trafilatura and will check if I can write a straightforward PR.

adbar · 2021-09-14T12:36:44Z

Hi @lopuhin, here is another tool that could be added: Mercury Parser.
(source: adbar/trafilatura#114)

adbar · 2022-01-05T13:43:22Z

Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue adbar/trafilatura#156.

Seirdy · 2022-04-01T02:38:46Z

Another tool to consider is Azure Immersive Reader, used in Microsoft Edge.

BradKML · 2023-04-02T07:33:28Z

Seconded this, but also would like to see:

which ones are better (F1/precision/accuracy/recall) relative to speed in the same vein as Squash Benchmark or Matt Mahoney for compression algorithms (since there will always a tradeoff between performance and speed)
bigger datasets for re-evaluating the benchmark since having a larger diversity of articles from blogs may how a stronger use case

BradKML · 2024-05-07T03:21:38Z

With the current advancement in RAGs with LLMs I think these benchmarks would be paramount to help in gathering information, and is really due for an update.
P.S. DragNet has a new fork now https://github.com/currentslab/extractnet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding more tools to the benchmark? #3

Adding more tools to the benchmark? #3

adbar commented Jun 26, 2020

lopuhin commented Jun 29, 2020

adbar commented Jul 7, 2020

adbar commented Sep 14, 2021

adbar commented Jan 5, 2022

Seirdy commented Apr 1, 2022

BradKML commented Apr 2, 2023

BradKML commented May 7, 2024 •

edited

Loading

Adding more tools to the benchmark? #3

Adding more tools to the benchmark? #3

Comments

adbar commented Jun 26, 2020

lopuhin commented Jun 29, 2020

adbar commented Jul 7, 2020

adbar commented Sep 14, 2021

adbar commented Jan 5, 2022

Seirdy commented Apr 1, 2022

BradKML commented Apr 2, 2023

BradKML commented May 7, 2024 • edited Loading

BradKML commented May 7, 2024 •

edited

Loading