Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle badly formatted JSON-LD data. #87

Open
shiquanwang opened this issue Aug 20, 2018 · 3 comments · May be fixed by gaurav19063/extruct#1
Open

Handle badly formatted JSON-LD data. #87

shiquanwang opened this issue Aug 20, 2018 · 3 comments · May be fixed by gaurav19063/extruct#1

Comments

@shiquanwang
Copy link
Contributor

Some web pages contain badly formatted JSON-LD data, e.g., an example

The JSON-LD in this page is:


{
  "@context": "http://schema.org",
        "@type": "Product",
                "name": "Black 'Clint' FT0511 cat eye sunglasses",
                "image": "https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001",
		"brand": {
                  "@type": "Thing",
                  "name": "Tom Ford"
                },
                "offers": {
                	"@type": "Offer",
                	"priceCurrency": "GBP",
                	"price": "285.00",
                	"itemCondition": "http://schema.org/NewCondition",
                	"availability": "http://schema.org/InStock"
                }
    }
}

In the JSON-LD above, the last } is extra. And extruct or json.loads won't handle it properly.

The json.loads in Python after 3.5 will give detailed error information as JSONDecodeError: Extra data: line 19 column 1 (char 624)

In [7]: try:
   ...:     data = json.loads(json_ld_string)
   ...: except json.JSONDecodeError as err:
   ...:     print(err)
   ...:     print(err.msg)
   ...:     print(err.pos)
   ...:
Extra data: line 19 column 1 (char 624)
Extra data
624

The error.msg and error.pos can give some hint to fix the JSON-LD data, e.g., this one we can remove the character at position 624 and parse the data string again to correctly get:

{'@context': 'http://schema.org',
 '@type': 'Product',
 'brand': {'@type': 'Thing', 'name': 'Tom Ford'},
 'image': 'https://debenhams.scene7.com/is/image/Debenhams/60742_1515029001',
 'name': "Black 'Clint' FT0511 cat eye sunglasses",
 'offers': {'@type': 'Offer',
            'availability': 'http://schema.org/InStock',
            'itemCondition': 'http://schema.org/NewCondition',
            'price': '285.00',
            'priceCurrency': 'GBP'}}

There're many possible format errors and some can be fixed easily some might be harder or even impossible.

I propose 3 ways to improve the situation:

  • extruct try various ways to fix the json-ld data case by case, but need to adapt to Python >= 3.5 to allow to get detailed error info
  • extruct allow the user to pass in a function to parse JSON data, and let the user to handle his own possible error types
  • extruct can output the extracted JSON-LD string not parsed data and let the user to parse and handle his own possible error types

I personally recommend the latter 2 ways.

Thanks.

@kmike
Copy link
Member

kmike commented Aug 22, 2018

I guess this provides more motivation for #69, though I'd prefer json decoding function to be an argument, not a global option.

Providing something which handles more cases by default makes sense to me, though we may start just with having a good example in README.

@kmike
Copy link
Member

kmike commented Aug 22, 2018

Maybe other libraries like demjson or yajl can handle it (see http://deron.meranda.us/python/demjson/demjson-2.2.4/docs/demjson.html#-decode - it seems there is an option to return data after the error).

@gaurav19063
Copy link

Updated JSON-Ld can autocorrect badly formatted JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants