Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: relax filtering of heading elements with classnames that include the word "header" #868

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

module.exports = {
"parserOptions": {
"ecmaVersion": 6,
"ecmaVersion": 2017,
},
"env": {
"es6": true,
Expand Down
15 changes: 10 additions & 5 deletions Readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,10 @@ Readability.prototype = {
REGEXPS: {
// NOTE: These two regular expressions are duplicated in
// Readability-readerable.js. Please keep both copies in sync.
unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|header|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,
unlikelyCandidates: /-ad-|ai2html|banner|breadcrumbs|combx|comment|community|cover-wrap|disqus|extra|footer|gdpr|legends|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|yom-remote/i,
okMaybeItsACandidate: /and|article|body|column|content|main|shadow/i,

positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story|header/i,
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i,
byline: /byline|author|dateline|writtenby|p-author/i,
Expand Down Expand Up @@ -495,7 +495,7 @@ Readability.prototype = {
// could assume it's the full title.
var headings = this._concatNodeLists(
doc.getElementsByTagName("h1"),
doc.getElementsByTagName("h2")
doc.getElementsByTagName("h2"),
);
var trimmedTitle = curTitle.trim();
var match = this._someNode(headings, function(heading) {
Expand Down Expand Up @@ -1393,7 +1393,7 @@ Readability.prototype = {
if (!parsed["@type"] && Array.isArray(parsed["@graph"])) {
parsed = parsed["@graph"].find(function(it) {
return (it["@type"] || "").match(
this.REGEXPS.jsonLdArticleTypes
this.REGEXPS.jsonLdArticleTypes,
);
});
}
Expand Down Expand Up @@ -1563,6 +1563,8 @@ Readability.prototype = {
metadata.siteName = this._unescapeHtmlEntities(metadata.siteName);
metadata.publishedTime = this._unescapeHtmlEntities(metadata.publishedTime);

this.log("getArticleMetadata complete", metadata);

return metadata;
},

Expand Down Expand Up @@ -2352,7 +2354,7 @@ Readability.prototype = {
}

var textContent = articleContent.textContent;
return {
var parsedArticle = {
title: this._articleTitle,
byline: metadata.byline || this._articleByline,
dir: this._articleDir,
Expand All @@ -2364,6 +2366,9 @@ Readability.prototype = {
siteName: metadata.siteName || this._articleSiteName,
publishedTime: metadata.publishedTime,
};

this.log("parse complete", parsedArticle);
return parsedArticle;
},
};

Expand Down
2 changes: 1 addition & 1 deletion test/test-pages/buzzfeed-1/expected-metadata.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"title": "Student Dies After Diet Pills She Bought Online \"Burned Her Up From Within\"",
"byline": null,
"byline": "Mark Di Stefano\n BuzzFeed News Reporter",
"dir": null,
"lang": "en",
"excerpt": "An inquest into Eloise Parry's death has been adjourned until July.",
Expand Down
85 changes: 53 additions & 32 deletions test/test-pages/buzzfeed-1/expected.html
Original file line number Diff line number Diff line change
@@ -1,40 +1,61 @@
<div id="readability-page-1" class="page">
<div id="buzz_sub_buzz">
<div id="superlist_3758406_5547137" rel:buzz_num="1">
<h2>The mother of a woman who took suspected diet pills bought online has described how her daughter was “literally burning up from within” moments before her death.</h2>
<p> <span>West Merica Police</span></p>
</div>
<div id="superlist_3758406_5547213" rel:buzz_num="2">
<p>Eloise Parry, 21, was taken to Royal Shrewsbury hospital on 12 April after taking a lethal dose of highly toxic “slimming tablets”. </p>
<p>“The drug was in her system, there was no anti-dote, two tablets was a lethal dose – and she had taken eight,” her mother, Fiona, <a href="https://www.westmercia.police.uk/article/9501/A-tribute-to-Eloise-Aimee-Parry-written-by-her-mother-Fiona-Parry">said in a statement</a> yesterday.</p>
<p>“As Eloise deteriorated, the staff in A&amp;E did all they could to stabilise her. As the drug kicked in and started to make her metabolism soar, they attempted to cool her down, but they were fighting an uphill battle.</p>
<p>“She was literally burning up from within.”</p>
<p>She added: “They never stood a chance of saving her. She burned and crashed.”</p>
</div>
<div id="superlist_3758406_5547140" rel:buzz_num="3">
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" height="412" width="203" /></p>
</div>
<p>Facebook</p>
<div rel:bf_bucket="track" track="{&quot;c&quot;:&quot;7FNW2J7&quot;,&quot;u&quot;:&quot;7717MJ7&quot;,&quot;buzz&quot;:&quot;diet-pills-burns-up&quot;,&quot;user&quot;:&quot;markdistefano&quot;,&quot;types&quot;:[100],&quot;queries&quot;:[]}">
<header id="post-3758406" rel:bf_bucket="track" track="{&quot;c&quot;:&quot;7FNW2J7&quot;,&quot;u&quot;:&quot;7717MJ7&quot;,&quot;buzz&quot;:&quot;diet-pills-burns-up&quot;,&quot;user&quot;:&quot;markdistefano&quot;,&quot;types&quot;:[100],&quot;queries&quot;:[]}" rel:ptool="true" rel:ptool_code="0.0.1.2.0.0" rel:owner="markdistefano" rel:advertiser="0" rel:partner="0" rel:data="{&quot;buzz_id&quot;:&quot;3758406&quot;,&quot;type&quot;:&quot;super&quot;,&quot;uri&quot;:&quot;diet-pills-burns-up&quot;,&quot;form_id&quot;:&quot;20&quot;,&quot;category&quot;:&quot;UKNews&quot;}" rel:ptool_stats="{&quot;impressions&quot;:&quot;653,817&quot;,&quot;email_shares&quot;:&quot;81&quot;,&quot;pinterest_count&quot;:&quot;&quot;,&quot;twitter_count&quot;:&quot;251&quot;,&quot;viral_lift&quot;:&quot;1.7X&quot;,&quot;facebook_count&quot;:&quot;665&quot;}">
<div id="buzz_header" rel:gt_cat="[ttp]:header">
<hgroup>
<a name="post-title"></a>
<p>
<b>An inquest into Eloise Parry’s death has been adjourned until July.</b>
</p>
<span>
<span id="update_posted_time_3758406">posted on April 21, 2015, at 11:29 a.m.</span>
</span>
</hgroup>
</div>
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" height="412" width="412" /></p>
</header>
<div data-print="body" rel:gt_cat="[ttp]:content">
<div id="buzz_sub_buzz">
<div id="superlist_3758406_5547137" rel:buzz_num="1">
<h2>The mother of a woman who took suspected diet pills bought online has described how her daughter was “literally burning up from within” moments before her death.</h2>
<p> <span>West Merica Police</span></p>
</div>
<div id="superlist_3758406_5547213" rel:buzz_num="2">
<p>Eloise Parry, 21, was taken to Royal Shrewsbury hospital on 12 April after taking a lethal dose of highly toxic “slimming tablets”. </p>
<p>“The drug was in her system, there was no anti-dote, two tablets was a lethal dose – and she had taken eight,” her mother, Fiona, <a href="https://www.westmercia.police.uk/article/9501/A-tribute-to-Eloise-Aimee-Parry-written-by-her-mother-Fiona-Parry">said in a statement</a> yesterday.</p>
<p>“As Eloise deteriorated, the staff in A&amp;E did all they could to stabilise her. As the drug kicked in and started to make her metabolism soar, they attempted to cool her down, but they were fighting an uphill battle.</p>
<p>“She was literally burning up from within.”</p>
<p>She added: “They never stood a chance of saving her. She burned and crashed.”</p>
</div>
<div id="superlist_3758406_5547140" rel:buzz_num="3">
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608056-15.jpg" height="412" width="203" /></p>
</div>
<p>Facebook</p>
</div>
<div>
<div>
<p><img src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" rel:bf_image_src="http://ak-hdl.buzzfed.com/static/2015-04/21/5/enhanced/webdr12/grid-cell-2501-1429608057-18.jpg" height="412" width="412" /></p>
</div>
<p>Facebook</p>
</div>
</div>
<div id="superlist_3758406_5547284" rel:buzz_num="4">
<p>West Mercia police <a href="https://www.westmercia.police.uk/article/9500/Warning-Issued-As-Shrewsbury-Woman-Dies-After-Taking-Suspected-Diet-Pills">said the tablets were believed to contain dinitrophenol</a>, known as DNP, which is a highly toxic industrial chemical. </p>
<p>“We are undoubtedly concerned over the origin and sale of these pills and are working with partner agencies to establish where they were bought from and how they were advertised,” said chief inspector Jennifer Mattinson from the West Mercia police.</p>
<p>The Food Standards Agency warned people to stay away from slimming products that contained DNP.</p>
<p>“We advise the public not to take any tablets or powders containing DNP, as it is an industrial chemical and not fit for human consumption,” it said in a statement.</p>
</div>
<div id="superlist_3758406_5547219" rel:buzz_num="5">
<h2>Fiona Parry issued a plea for people to stay away from pills containing the chemical.</h2>
<p>“[Eloise] just never really understood how dangerous the tablets that she took were,” she said. “Most of us don’t believe that a slimming tablet could possibly kill us.</p>
<p>“DNP is not a miracle slimming pill. It is a deadly toxin.”</p>
</div>
<p>Facebook</p>
</div>
<p><a href="http://buzzfeed.com/"><b>Check out more articles on BuzzFeed.com!</b></a></p>
</div>
<div id="superlist_3758406_5547284" rel:buzz_num="4">
<p>West Mercia police <a href="https://www.westmercia.police.uk/article/9500/Warning-Issued-As-Shrewsbury-Woman-Dies-After-Taking-Suspected-Diet-Pills">said the tablets were believed to contain dinitrophenol</a>, known as DNP, which is a highly toxic industrial chemical. </p>
<p>“We are undoubtedly concerned over the origin and sale of these pills and are working with partner agencies to establish where they were bought from and how they were advertised,” said chief inspector Jennifer Mattinson from the West Mercia police.</p>
<p>The Food Standards Agency warned people to stay away from slimming products that contained DNP.</p>
<p>“We advise the public not to take any tablets or powders containing DNP, as it is an industrial chemical and not fit for human consumption,” it said in a statement.</p>
</div>
<div id="superlist_3758406_5547219" rel:buzz_num="5">
<h2>Fiona Parry issued a plea for people to stay away from pills containing the chemical.</h2>
<p>“[Eloise] just never really understood how dangerous the tablets that she took were,” she said. “Most of us don’t believe that a slimming tablet could possibly kill us.</p>
<p>“DNP is not a miracle slimming pill. It is a deadly toxin.”</p>
<div>
<p>Mark di Stefano is a breaking news reporter for BuzzFeed News and is based in Sydney, Australia. </p>
</div>
</div>
</div>
21 changes: 21 additions & 0 deletions test/test-pages/engadget/expected.html
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
</section>
<div>
<h4>Engadget Score <div>
<figure>
<div data-rating-from="1" data-rating-to="55">
<p><span>Poor</span></p>
</div>
<div data-rating-from="55" data-rating-to="70">
<p><span>Uninspiring</span></p>
</div>
<div data-rating-from="70" data-rating-to="85">
<p><span>Good</span></p>
</div>
<div data-rating-from="85" data-rating-to="100">
<p><span>Excellent</span></p>
</div>
<figcaption>Key</figcaption>
</figure>
</div>
</h4>
<div>
<div>
<p><span>from</span>&nbsp;<span>$610.00</span>
Expand All @@ -22,13 +40,15 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
<div>
<div>
<h5>Pros</h5>
<ul>
<li>Most powerful hardware ever in a home console </li>
<li>Solid selection of enhanced titles </li>
<li>4K Blu-ray drive is great for movie fans </li>
</ul>
</div>
<div>
<h5>Cons</h5>
<ul>
<li>Expensive </li>
<li>Not worth it if you don’t have a 4K TV </li>
Expand All @@ -37,6 +57,7 @@ <h4> Gallery: Xbox One X | 14 Photos </h4>
</div>
</div>
<div>
<h4>Summary</h4>
<p>As promised, the Xbox One X is the most powerful game console ever. In practice, though, it really just puts Microsoft on equal footing with Sony’s PlayStation 4 Pro. 4K/HDR enhanced games look great, but it’s lack of VR is disappointing in 2017.</p>
</div>
</div>
Expand Down
Loading