Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: relax filtering of heading elements with classnames that include the word "header" #868

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

inhumantsar
Copy link
Contributor

This removes header from unlikely and adds it to positive in an attempt to avoid filtering legitimate heading elements.

It does seem to improve parsing generally, even capturing some previously ignored metadata, but it does introduce a few unwanted artifacts.

Closes #855 and will likely have merge conflicts with #867 and #866

@gijsk
Copy link
Contributor

gijsk commented May 17, 2024

Hi @inhumantsar ! Thanks for investigating this. Did you mean to mark this as a work in progress and/or would you like feedback on this at this point?

@inhumantsar
Copy link
Contributor Author

Did you mean to mark this as a work in progress and/or would you like feedback on this at this point?

It should be complete but wanted to get the other PRs in before calling it ready.

I'll rebase it and make sure it's all good tonight or tomorrow, then mark it for review

@inhumantsar
Copy link
Contributor Author

inhumantsar commented May 19, 2024

ok so i had a chance to refresh my memory. i put this into draft until the PR with all of the publishedTime test case changes were in so that this PR wouldn't also include them.

unambiguously positive impacts:

  • removes a suggested stories section from cnet
  • adds a headings and review score information to engadget
  • removes "Supported by" from nytimes-2
  • adds dt elements to google-sre-book-1 (definitions seem a bit pointless without the terms they're defining!)
  • adds section headings to nytimes-4
  • captures the byline for wordpress

ambiguously positive impacts:

  • captures the subheading and author information for buzzfeed-1 but adds images (the URLs for which appear to be broken) and caption "Facebook" (as in the source of the image) to them.
  • replaces "Contents" in the excerpt field for mercurial with the page title.
  • adds the page title to mercurial (good thing) twice (not so good).
  • now identifies an author for simplyfound-1, but it captures this along with the date and several newlines: Joe Wee \n \n \n Monday, February 29, 2016 @ 11:10 PM UTC
  • correctly identifies the author for heraldsun-1, which was previously capturing the author from a related stories block but it captures it as by:\n\t\t\t\t\t\t\t\t\t\t Laurie Oakes
  • adds <meta> elements in the body of yahoo-3

negative impacts:

  • adds related coverage blocks to yahoo-3 as well as nytimes-1 and -2

another issue is that jsdom and JSDOMParser yield different results in testing for simplyfound-1:

  1) Test pages
       simplyfound-1
         jsdom
           should extract expected content:

      AssertionError: #readability-page-1 > div:nth-child(2) > p:nth-child(2) > i:nth-child(1)(in: ``<i></i> <span>
                <a href="http://fakehost/latest?author=joewee" rel="index,follow"><span>Joe Wee</span></a>&nbsp; <i></i> </span> &nbsp;&nbsp;<i></i> <span> Monday, February 29, 2016 @ 11:10 PM UTC </span> ``): expected 'i' to deeply equal '#text(The Raspberry Pi Foundation sta…'
      + expected - actual

      -i
      +#text(The Raspberry Pi Foundation started by a handful of volunteers in 2012 when they released the original Raspberry Pi 256MB Model B without knowing what to expect. In a short four-year period they have grown to over sixty full-time employees and have shipped over )


  2) Test pages
       simplyfound-1
         jsdom
           should extract expected byline:

      AssertionError: expected 'Joe Wee       \n          \n         …' to deeply equal 'Joe Wee       \n          \n         …'
      + expected - actual

      -Joe Wee       
      +Joe Wee       
                 
                   
                     Monday, February 29, 2016 @ 11:10 PM UTC
      

  3) Test pages
       simplyfound-1
         JSDOMParser
           should extract expected byline:
     AssertionError: expected null to deeply equal 'Joe Wee       \n          \n         …'

i can probably deal with these less-than-ideal captures with some simple heuristics but not sure if that should get its own PR. i don't know where to start with the jsdom vs JSDOMParser issue though.

@inhumantsar inhumantsar force-pushed the substack-headers-855 branch from 0bbbf9f to a2ef447 Compare May 19, 2024 23:39
@inhumantsar inhumantsar marked this pull request as ready for review May 19, 2024 23:59
@inhumantsar inhumantsar force-pushed the substack-headers-855 branch from a2ef447 to 30211ad Compare June 11, 2024 15:20
@juliomuhlbauer
Copy link

let's get this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

H1 Headers ignored/skipped on https://www.astralcodexten.com/p/practically-a-book-review-rootclaim
3 participants