html #17

firmai · 2024-11-19T20:18:29Z

I was wondering whether there is a functionality to not wipe all the html in the extraction process, for example, for the 10-ks it would be nice to know what is for example tables, lists, headings etc, this would give html tag information and probably some info about hierarchical relationships

There probably is also some benefit in getting the row_id, if ever it is used in some vectorised database, which most of the use cases is for, one would like to point back to where one got the text in the filing.

It would be awesome to get more hierarchy somehow out of the dataset ala these guys https://github.com/alphanome-ai/sec-parser

john-friedman · 2024-11-19T20:42:56Z

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

firmai · 2024-11-20T12:55:45Z

Hi @firmai, I'll be implementing table parsing in a future advanced parser.

It's a bit frustrating because I've known and prototyped how to extracted nested hierarchy + tables from html, pdf, and txt filings since September, but I can't justify the time expenditure (~ 2 months) without revenue/funding source.

Why don't you call out for sponsorship on Linkedin, I will share it as far as I can, and can also contribute by January? Also if you need some help on this, will be happy to open an MS teams to colaborate, think this would be a really good addition to your open source package.

john-friedman · 2024-11-20T22:33:12Z

Hey @firmai - that's a great suggestion! Let me think about it and get back to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

html #17

html #17

firmai commented Nov 19, 2024 •

edited

Loading

john-friedman commented Nov 19, 2024

firmai commented Nov 20, 2024

john-friedman commented Nov 20, 2024

html #17

html #17

Comments

firmai commented Nov 19, 2024 • edited Loading

john-friedman commented Nov 19, 2024

firmai commented Nov 20, 2024

john-friedman commented Nov 20, 2024

firmai commented Nov 19, 2024 •

edited

Loading