-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
html #17
Comments
Hi @firmai, I'll be implementing table parsing in a future advanced parser. It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source. |
Why don't you call out for sponsorship on Linkedin, I will share it as far as I can, and can also contribute by January? Also if you need some help on this, will be happy to open an MS teams to colaborate, think this would be a really good addition to your open source package. |
Hey @firmai - that's a great suggestion! Let me think about it and get back to you. |
I was wondering whether there is a functionality to not wipe all the html in the extraction process, for example, for the 10-ks it would be nice to know what is for example tables, lists, headings etc, this would give html tag information and probably some info about hierarchical relationships
There probably is also some benefit in getting the row_id, if ever it is used in some vectorised database, which most of the use cases is for, one would like to point back to where one got the text in the filing.
It would be awesome to get more hierarchy somehow out of the dataset ala these guys https://github.com/alphanome-ai/sec-parser
The text was updated successfully, but these errors were encountered: