-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to correct "nasty" jsonl+ld #53
Comments
Here's an example: |
Hey @maugch , thanks for the report.
checking what chars is around the offset |
I've had similar issues with other chars but I'm not sure exactly which, because every time I do a result[XX] where XX is the value on the the exception, I get either a blank space or a letter. I suppose the only possible solution is to check for the next square bracket and take the ellipsis before it as the closing one and escape all others. A further check is if there are "," since it might be a list of strings. Actually it might be enough to check all " not followed by , (apart the last one followed by ]. |
Might be unrelated but extruct json parser also chokes on Example case: There are some tabs in json that break extruct. And can be solved by replacing them away:
I think extruct should either:
|
Btw @maugch I can't replicate your issue on www.browneyedbaker.com/nutter-butter-snowmen/
It works correctly here |
Hi @Granitosaurus - I downloaded the URL you supplied above and I was able to decode the JSON using My code, approximately: >>> import user_agent, requests, json, extruct
>>> from scrapy.http import HtmlResponse
>>> r = requests.get('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', headers={'User-Agent': user_agent.generate_user_agent()})
>>> response = HtmlResponse('https://www.alltricks.fr/F-41493-pieces-roues/P-81593-fond_de_jante_notubes_yellow_tape_25_mm_pour_5_jantes', body=r.content)
>>> data = response.css('script[type="application/ld+json"]::text').extract_first()
>>> json.loads(data)
{'@context': 'http://schema.org/',
'@type': 'Product',
'aggregateRating': {'@type': 'AggregateRating',
'ratingValue': '4.1053',
'reviewCount': '19'},
'brand': {'@type': 'Thing', 'name': 'NoTubes'},
'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes. Détails : Largeur : 25 mm. Longueur : 9.144 m (10 Yards). Un rouleau convient pour 5 jantes 26'' ou 4 jantes 29''. Compatibilités : ZTR 355 (26", 650b, 29"). ZTR Crest. ZTR Arch EX. ZTR Flow EX. #shortcode_video .row { display:block; } #shortcode_video .col { padding:15px; }',
'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'price': '14.99',
'priceCurrency': 'EUR',
'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}
>>> extruct.jsonld.JsonLdExtractor().extract(r.content)
[{'@context': 'http://schema.org/',
'@type': 'Product',
'aggregateRating': {'@type': 'AggregateRating',
'ratingValue': '4.1053',
'reviewCount': '19'},
'brand': {'@type': 'Thing', 'name': 'NoTubes'},
'description': 'Scotch jaune spécial pour rendre étanche les jantes tubeless NoTubes. Détails : Largeur : 25 mm. Longueur : 9.144 m (10 Yards). Un rouleau convient pour 5 jantes 26'' ou 4 jantes 29''. Compatibilités : ZTR 355 (26", 650b, 29"). ZTR Crest. ZTR Arch EX. ZTR Flow EX. #shortcode_video .row { display:block; } #shortcode_video .col { padding:15px; }',
'image': 'https://media.alltricks.com/medium/56bdff3278142.jpg',
'name': 'Fond de Jante NOTUBES YELLOW TAPE 25 mm Pour 5 Jantes',
'offers': {'@type': 'Offer',
'availability': 'http://schema.org/InStock',
'price': '14.99',
'priceCurrency': 'EUR',
'seller': {'@type': 'Organization', 'name': 'Alltricks'}}}] |
I did try again now and I don't get an exception. I suppose they corrected it. According to my previous comment, there was a text "buttons" that I don't see anymore. I see this on firefox: My code is simple (now even simplified for this comment):
|
Hey @maugch - Glad to hear your problem has resolved. Pity we couldn't capture test cases before it disappeared, though. :) @Granitosaurus - Any chance you can replicate, and if so can you capture failing HTML so we can use it to build a test case? |
Hey folks, I'll close this for now, but if anyone can find us a failure case we can work with, we'll reopen. :) |
Here's a tragic example: http://montalvoarts.org/events/summernights18_salsa/ They omit a closing brace in their "location" field in their ld+json in every event on their site. When parsing manually, I'm able to correct this and extract the events. I'm looking at moving to extruct and it would be great if this site kept working. |
For reference, this is json-ld from the site:
|
Some (but not all) issues raised in this thread were fixed in #85 |
Again another jsonld with wrong data. Again a Recipe site. I suppose there is a wordpress plugin that isn't working correctly. There is a ] at the end that shouldn't be there ` |
I've found at least a couple of bad json+ld that extruct can't read.
The reason are ellipsis inside the text. For example:
Html allow this, but it's not possible to read it. Is there an easy way to correct similar issues automatically?
The text was updated successfully, but these errors were encountered: