-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding more tools to the benchmark? #3
Comments
hi @adbar thanks for the pointers of the tools and evaluation. Another tool which was referenced elsewhere by @saippuakauppias was https://github.com/go-shiori/go-readability. It would be great to add them, we only need to write a script which outputs results in JSON. PRs are welcome, and I hope to have time to add more tools soon as well, it would be great to have more tools evaluated. |
Thanks for your answer, I've added JSON to trafilatura and will check if I can write a straightforward PR. |
Hi @lopuhin, here is another tool that could be added: Mercury Parser. |
Hi @lopuhin, just a quick follow-up: the benchmark could also be updated using the latest versions of the tools, see for instance the issue adbar/trafilatura#156. |
Another tool to consider is Azure Immersive Reader, used in Microsoft Edge. |
Seconded this, but also would like to see:
|
With the current advancement in RAGs with LLMs I think these benchmarks would be paramount to help in gathering information, and is really due for an update. |
Hi,
Thanks for your contribution, it's really useful to see evaluations on real-world data! There are further extraction tools for Python which this repository doesn't feature yet and which could be more efficient than some of the ones you're mentioning. You might have a look at
goose3
jusText
(especially with a custom configuration)inscriptis
(html-to-txt conversion)trafilatura
(disclaimer: I'm the author).Or is there a reason why you didn't use them in the first place? I'd be curious to hear about it.
For more details please refer to the evaluation I've performed. The code including baselines is available here.
The text was updated successfully, but these errors were encountered: