Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typed JSON API #87

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kabirkhan
Copy link

Context

Currently most loader functions use the JSONOutput return type. This type is pretty non specific and a bit hard to reason about downstream. I find myself having to cast or validate the resulting type all the time.

The current loaders return JSONOutput because this file e.g. is valid JSON and would be parsed properly by ujson

test_file.json

hello

However, if we get this JSON file as an input in most of our code paths, we would want to raise an error as this is almost certainly invalid for what we want to do next.

This PR adds validation on top of the existing JSON API to ensure you get the expected type from a loader function.

If we like this approach, I can add to the rest of the API, just implementing the most commonly used functions from the JSON API for now.

Summary of Changes

Add read_json_dict - read JSON file and validate the resulting object is a dict
Add read_json_list - read JSON file and validate the resulting object is a list of dicts
Add read_jsonl_dicts - read JSONL file and validate each line is a valid dict in the resulting generator

@kabirkhan kabirkhan changed the title Feature/typed json api Typed JSON API Mar 10, 2023
return obj


def json_loads_list(data: Union[str, bytes]) -> List[Dict[str, Any]]:
Copy link
Contributor

@adrianeboyd adrianeboyd Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and read_json_list only check List not List[Dict[str, Any]]? I wouldn't expect a function that's called read_json_list to have the additional dict behavior/type?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this part is a little weird. Again, what I want 99% of the time when calling read_json_list is to have a list of dicts. But the naming does leave something to be desired. There is some validation in place for the read_json_list function, just not for the json_loads_list. I can copy the validation over there though.

Is read_json_dicts a better name here? similar to the read_jsonl_dicts.

return data


def read_json_list(path: FilePath, validate_inner: bool = False, skip_invalid: bool = False) -> List[Dict[str, Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a bit of a typing problem because the return type depends on the value of validate_inner.

I think this method with these options and types may be getting too specific for inclusion in the srsly API.

If it is going to be in srsly, then I think it would be better as read_json_list() -> List and read_json_list_of_dicts() -> List[Dict[str, JSONOutput]] or something along those lines?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that was weird. I fixed these types and separated this logic into 2 functions.

This parsing as a list idea is kinda extra honestly. I think the main functions I really think should live in srsly are:

  • read_json_dict -> We parse a normal JSON file expecting a JSON object (e.g. for configs and stuff) all the time
  • read_jsonl_dicts -> We use this to load data all the time and expect each line to be a valid JSON object


data = read_json(path)
if not isinstance(data, list):
raise ValueError("Invalid JSON, data could not be parsed to a list of dicts.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think describing any of the input as "invalid JSON" is going to be confusing for users. Something like "invalid value" instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point, I changed this language so it's reads that it can't be parsed to the expected type

return obj


def json_loads_list(data: Union[str, bytes]) -> List[Dict[str, JSONOutput]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return type is too specific for the method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants