Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust-only JSON schema construction #12

Closed
brandonwillard opened this issue Aug 21, 2024 · 6 comments · Fixed by #15
Closed

Rust-only JSON schema construction #12

brandonwillard opened this issue Aug 21, 2024 · 6 comments · Fixed by #15
Assignees
Labels
enhancement New feature or request question Further information is requested TGI Related to the integration with `text-generation-inference`

Comments

@brandonwillard
Copy link
Member

brandonwillard commented Aug 21, 2024

This issue is about converting the outlines JSON schema-to-regex logic to Rust.

First, let's determine how much needs to be converted and what the expected entry point(s) is(are)?

Second, we need to move to CFG-based schemas eventually. This conversion could serve as the beginning of that (open-source) effort. One way that could manifest: create CFGs for schemas and convert them to regex form. We could also consider an intermediate form that works for both targets.

@brandonwillard brandonwillard added enhancement New feature or request question Further information is requested TGI Related to the integration with `text-generation-inference` labels Aug 21, 2024
@rlouf
Copy link
Member

rlouf commented Aug 21, 2024

Rust's regex crate does not support advanced constructs like negative lookahead, which limits our ability to fully implement certain JSON Schema features (e.g., oneOf, anyOf). These constructs require more complex logic that would be more easily handled by defining the DFA rather than writing a regex.

I would shift from a regex-based approach to a DFA-based one, and use the data structures already present in the regex-automata crate. So JSON schemas would be translated into DFAs directly.

@brandonwillard
Copy link
Member Author

brandonwillard commented Aug 21, 2024

Rust's regex crate does not support advanced constructs like negative lookahead, which limits our ability to fully implement certain JSON Schema features (e.g., oneOf, anyOf). These constructs require more complex logic that would be more easily handled by defining the DFA rather than writing a regex.

Where are these missing constructs employed in outlines? We might need to explicitly list them out as requirements in this issue.

I would shift from a regex-based approach to a DFA-based one, and use the data structures already present in the regex-automata crate. So JSON schemas would be translated into DFAs directly.

Yeah, we don't want to have to rewrite any of the interegular logic involving regex parsing and DFAs. regex-automata looks likes a viable replacement for interegular; we just need to make sure that it actually provides all the necessary regex features and DFA information (e.g. easily enumerable states and transitions). See #10.

@brandonwillard
Copy link
Member Author

We should also move any discussions about Rust regex and DFA features to #10.

@ErikKaum
Copy link
Collaborator

ErikKaum commented Aug 23, 2024

I added #15 which is a port of the build_regex_from_schema function I wrote a while ago. Back then I essentially wrote it with the intention of being a 1-to-1 port. It's not 100% functional but, I estimate I could get it working without too much work. Currently 18 tests fail and 45 pass.

This PR doesn't take into account the DFA-based approach, so it wouldn't necessarily serve towards that goal.

So I think the concrete question would be:

  • should I continue working on this --> we could replace build_regex_from_schema with a rust version quite soon, but it wouldn't necessarily serve as a spring board for the DFA-based approach
  • or instead go for an approach where the JSON schemas would be translated into DFAs directly like Rémi mentioned, using most likely the regex-automata crate --> might take longer but most likely more useful long term

@brandonwillard brandonwillard linked a pull request Aug 23, 2024 that will close this issue
6 tasks
@brandonwillard
Copy link
Member Author

  • should I continue working on this --> we could replace build_regex_from_schema with a rust version quite soon, but it wouldn't necessarily serve as a spring board for the DFA-based approach

Yes, let's get your code working and move from there.

@ErikKaum
Copy link
Collaborator

Awesome, I'll get to work 🙌🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested TGI Related to the integration with `text-generation-inference`
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants