-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore invalid jsonld elements on the page source. #189
base: master
Are you sure you want to change the base?
Changes from 4 commits
38a1c9f
0b449e1
edcaa8b
8bd2b16
346f6b9
1625503
400dfae
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -28,14 +28,23 @@ def extract_items(self, document, base_url=None): | |||||
if items for item in items if item | ||||||
] | ||||||
|
||||||
def _may_be_get_json(self, script): | ||||||
try: | ||||||
return json.loads(script, strict=False) | ||||||
except Exception: | ||||||
return False | ||||||
|
||||||
def _extract_items(self, node): | ||||||
script = node.xpath('string()') | ||||||
try: | ||||||
# TODO: `strict=False` can be configurable if needed | ||||||
data = json.loads(script, strict=False) | ||||||
except ValueError: | ||||||
# sometimes JSON-decoding errors are due to leading HTML or JavaScript comments | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd rather not remove this comment, I think it's useful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added the comment back to the code. |
||||||
data = jstyleson.loads(HTML_OR_JS_COMMENTLINE.sub('', script), strict=False) | ||||||
data = self._may_be_get_json(script) | ||||||
# check if valid json. | ||||||
if not data: | ||||||
script = jstyleson.dispose( HTML_OR_JS_COMMENTLINE.sub('', script)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick - let's remove a an extra space here
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||||||
data = self._may_be_get_json(script) | ||||||
# After processing check if json is still valid. | ||||||
if not data: | ||||||
return False | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nitpick - we don't use return value, as this is a generator, so
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed. |
||||||
|
||||||
if isinstance(data, list): | ||||||
for item in data: | ||||||
yield item | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
<!DOCTYPE html> | ||
<html lang="en"> | ||
|
||
<head> | ||
<script type="application/ld+json"> | ||
{ foo: bar} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding the test! From what I understand, it would fail previously, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it should ignore this jsonld element since it is not valid. |
||
</script> | ||
<!-- only the below tag should be parsed --> | ||
<script type="application/ld+json"> | ||
{"foo" : "bar"} | ||
</script> | ||
</head> | ||
|
||
<body></body> | ||
|
||
</html> |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
[ {"foo" : "bar"} ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, but to me it would make more sense to return
None
here, what do you think? BothNone
andFalse
can be valid results of JSON parsing, butNone
looks like a more "neutral" value to use for error case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.