-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doesn't work on huge or improperly formatted json files #10
Comments
@davidawad - I have some good news regarding @tidwall - Apart from the good news mentioned above, there is quite a lot of what might be regarded as "bad news" if jj is supposed to handle the same file in a uniform manner. In the following:
GOOD NEWS: rjson produces valid JSON (an array of length 46) in about 3s.
GOOD NEWS: rjson can be applied to the output of
GOOD NEWS: BAD NEWS: BAD NEWS: "jj -p" produces invalid JSON that cannot be repaired by rjson:
BAD NEWS: jsonlint -S fails |
So I actually resorted to a cheap trick that solved my problem. I was able to modify the program creating this lossy JSON and now I just include an extra empty This wastes space (and is generally terrible), but saves massive time since I don't have to now parse the files to fix the messyness after the fact. I still contend that this is an issue for Thanks for your help |
@davidawad - Glad to hear you have an easy way out of the messiness. Having an extra {} is a small price to pay for conforming with a standard, which is the whole point about JSON -- it's a simple but expressive standard with a small price to pay. I've come to the conclusion that although in some cases JJ handles wonky JSON the way you'd want, there's usually some kind of "gotcha" -- which isn't surprising -- see above about having a worthwhile standard. |
While JJ can handle broken JSON in many cases, that is not it's intended purpose. Under the hood it uses the gjson Get function and the README states
There are some known consistencies with bad json that JJ can handle, such as missing or trailing tokens. But I doubt that every scenario is recoverable at the moment. Unterminated elements being one. Perhaps in the future I'll build a more defined approach to handling undefined JSON. 🤔 |
@tidwall - Thanks for the clarification. The surprising and unfortunate thing, I think, is that the -u and -p options do sometimes give inconsistent results when presented with quasi-JSON, thus inevitably raising the question of whether they always give consistent results when presented with strictly valid JSON. The good news (at least for me) is that the testing I've done on some very large (valid) JSON data files (up to 1,271,577,470 bytes when formatted) yields no surprises. |
Valid json will not give inconsistent results. If you find that they do then please let me know. |
I've been having a lot of specific problems that your tool would be PERFECT for, but I might be running against some bugs that are unique to my problem.
So normally I'd use jj to take a file that might not be perfect JSON (meaning that it may contain trailing slashes) and use jj to read this json file and parse it out, and then print it in the proper format in order for something else to consume that. I'm finding that this isn't working with larger json files and jj is blowing up their size (taking files from 44MB -> 5GB)
To be more specific, here's what I"m doing.
This works perfectly fine. I input a correct json object and both
jq
andjj
are able to handle that.JJ
is better for me because I have what I'll call lossy JSON. My lossy json has things like trailing commas in the file that cause it to fail typical validation.I need to read these JSON files in other programs after parsing them and making sure that they don't have these commas.
So I've been passing them through
JJ
and then sending them to jq to validate them.Now I've noticed that it's only when i use
jj -p
that JJ does this convenient functionality that I'm leveraging to clear the trailing commas.I want the file to be small but I also want it to be valid.
So I'm now doing something like this:
Now I'm getting the best of both worlds. It's expensive, but I don't care since I'm more interested in making sure the data is formatted correctly and don't care about my cpu grinding this out for a few more minutes.
The problem is that passing larger files through JJ -p is not blowing up their file size like crazy. (43 -> 732M)
Here's a screenshot of what I'm seeing that I think is causing these huge size increases.
Size of course, is no issue if the data is usable, however it doesn't seem to be so.
When attempting to parse the output file from just
jj -p
I get unusual problems.I've even tried expanding the python recursion depth to
10,000
and it still doesn't work.I've tried going from
jj -p | jj -u
and then passing that tojq
to validate it. But when I validate it I get problems like this.Here's an example of one of these huge JSON files I'm working with it's 43M
TL; DR :
jj -p
is blowing up filesizes, and passingjj -p | jj -u
is making files that aren't valid.My question for you, do you have any idea what could be going on here? I can't seem to get a proper version of this file saved the way I need it to be.
The text was updated successfully, but these errors were encountered: