Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html #17

Open
firmai opened this issue Nov 19, 2024 · 3 comments
Open

html #17

firmai opened this issue Nov 19, 2024 · 3 comments

Comments

@firmai
Copy link

firmai commented Nov 19, 2024

I was wondering whether there is a functionality to not wipe all the html in the extraction process, for example, for the 10-ks it would be nice to know what is for example tables, lists, headings etc, this would give html tag information and probably some info about hierarchical relationships

There probably is also some benefit in getting the row_id, if ever it is used in some vectorised database, which most of the use cases is for, one would like to point back to where one got the text in the filing.

It would be awesome to get more hierarchy somehow out of the dataset ala these guys https://github.com/alphanome-ai/sec-parser

@john-friedman
Copy link
Owner

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

@firmai
Copy link
Author

firmai commented Nov 20, 2024

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

Why don't you call out for sponsorship on Linkedin, I will share it as far as I can, and can also contribute by January? Also if you need some help on this, will be happy to open an MS teams to colaborate, think this would be a really good addition to your open source package.

@john-friedman
Copy link
Owner

Hey @firmai - that's a great suggestion! Let me think about it and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants