fix: relax filtering of heading elements with classnames that include the word "header" #868

inhumantsar · 2024-05-04T22:46:52Z

This removes header from unlikely and adds it to positive in an attempt to avoid filtering legitimate heading elements.

It does seem to improve parsing generally, even capturing some previously ignored metadata, but it does introduce a few unwanted artifacts.

Closes #855 and will likely have merge conflicts with #867 and #866

gijsk · 2024-05-17T16:23:09Z

Hi @inhumantsar ! Thanks for investigating this. Did you mean to mark this as a work in progress and/or would you like feedback on this at this point?

inhumantsar · 2024-05-17T22:07:14Z

Did you mean to mark this as a work in progress and/or would you like feedback on this at this point?

It should be complete but wanted to get the other PRs in before calling it ready.

I'll rebase it and make sure it's all good tonight or tomorrow, then mark it for review

inhumantsar · 2024-05-19T23:34:30Z

ok so i had a chance to refresh my memory. i put this into draft until the PR with all of the publishedTime test case changes were in so that this PR wouldn't also include them.

unambiguously positive impacts:

removes a suggested stories section from cnet
adds a headings and review score information to engadget
removes "Supported by" from nytimes-2
adds dt elements to google-sre-book-1 (definitions seem a bit pointless without the terms they're defining!)
adds section headings to nytimes-4
captures the byline for wordpress

ambiguously positive impacts:

captures the subheading and author information for buzzfeed-1 but adds images (the URLs for which appear to be broken) and caption "Facebook" (as in the source of the image) to them.
replaces "Contents" in the excerpt field for mercurial with the page title.
adds the page title to mercurial (good thing) twice (not so good).
now identifies an author for simplyfound-1, but it captures this along with the date and several newlines: Joe Wee \n \n \n Monday, February 29, 2016 @ 11:10 PM UTC
correctly identifies the author for heraldsun-1, which was previously capturing the author from a related stories block but it captures it as by:\n\t\t\t\t\t\t\t\t\t\t Laurie Oakes
adds <meta> elements in the body of yahoo-3

negative impacts:

adds related coverage blocks to yahoo-3 as well as nytimes-1 and -2

another issue is that jsdom and JSDOMParser yield different results in testing for simplyfound-1:

  1) Test pages
       simplyfound-1
         jsdom
           should extract expected content:

      AssertionError: #readability-page-1 > div:nth-child(2) > p:nth-child(2) > i:nth-child(1)(in: ``<i></i> <span>
                <a href="http://fakehost/latest?author=joewee" rel="index,follow"><span>Joe Wee</span></a>&nbsp; <i></i> </span> &nbsp;&nbsp;<i></i> <span> Monday, February 29, 2016 @ 11:10 PM UTC </span> ``): expected 'i' to deeply equal '#text(The Raspberry Pi Foundation sta…'
      + expected - actual

      -i
      +#text(The Raspberry Pi Foundation started by a handful of volunteers in 2012 when they released the original Raspberry Pi 256MB Model B without knowing what to expect. In a short four-year period they have grown to over sixty full-time employees and have shipped over )


  2) Test pages
       simplyfound-1
         jsdom
           should extract expected byline:

      AssertionError: expected 'Joe Wee       \n          \n         …' to deeply equal 'Joe Wee       \n          \n         …'
      + expected - actual

      -Joe Wee       
      +Joe Wee       
                 
                   
                     Monday, February 29, 2016 @ 11:10 PM UTC
      

  3) Test pages
       simplyfound-1
         JSDOMParser
           should extract expected byline:
     AssertionError: expected null to deeply equal 'Joe Wee       \n          \n         …'

i can probably deal with these less-than-ideal captures with some simple heuristics but not sure if that should get its own PR. i don't know where to start with the jsdom vs JSDOMParser issue though.

juliomuhlbauer · 2024-12-19T11:44:59Z

let's get this merged!

inhumantsar force-pushed the substack-headers-855 branch from 0bbbf9f to a2ef447 Compare May 19, 2024 23:39

inhumantsar marked this pull request as ready for review May 19, 2024 23:59

inhumantsar added 3 commits June 11, 2024 10:17

fix: less strict filtering of headers

b8c640b

derp

ba0b0ee

update test cases, add a couple of logs

30211ad

inhumantsar force-pushed the substack-headers-855 branch from a2ef447 to 30211ad Compare June 11, 2024 15:20

inhumantsar added 2 commits June 11, 2024 10:31

dangling comma

6c5b0cc

fix linting

b259c95

danielnixon mentioned this pull request Dec 19, 2024

Article on substack skips the headings #928

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: relax filtering of heading elements with classnames that include the word "header" #868

fix: relax filtering of heading elements with classnames that include the word "header" #868

inhumantsar commented May 4, 2024

gijsk commented May 17, 2024

inhumantsar commented May 17, 2024

inhumantsar commented May 19, 2024 •

edited

Loading

juliomuhlbauer commented Dec 19, 2024

fix: relax filtering of heading elements with classnames that include the word "header" #868

Are you sure you want to change the base?

fix: relax filtering of heading elements with classnames that include the word "header" #868

Conversation

inhumantsar commented May 4, 2024

gijsk commented May 17, 2024

inhumantsar commented May 17, 2024

inhumantsar commented May 19, 2024 • edited Loading

juliomuhlbauer commented Dec 19, 2024

inhumantsar commented May 19, 2024 •

edited

Loading