-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: relax filtering of heading elements with classnames that include the word "header" #868
base: main
Are you sure you want to change the base?
Conversation
Hi @inhumantsar ! Thanks for investigating this. Did you mean to mark this as a work in progress and/or would you like feedback on this at this point? |
It should be complete but wanted to get the other PRs in before calling it ready. I'll rebase it and make sure it's all good tonight or tomorrow, then mark it for review |
ok so i had a chance to refresh my memory. i put this into draft until the PR with all of the unambiguously positive impacts:
ambiguously positive impacts:
negative impacts:
another issue is that
i can probably deal with these less-than-ideal captures with some simple heuristics but not sure if that should get its own PR. i don't know where to start with the |
0bbbf9f
to
a2ef447
Compare
a2ef447
to
30211ad
Compare
let's get this merged! |
This removes
header
from unlikely and adds it topositive
in an attempt to avoid filtering legitimate heading elements.It does seem to improve parsing generally, even capturing some previously ignored metadata, but it does introduce a few unwanted artifacts.
Closes #855 and will likely have merge conflicts with #867 and #866