Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/HC improvements for zk-regex Noir support #8

Closed
wants to merge 9 commits into from

Conversation

ewynx
Copy link

@ewynx ewynx commented Oct 2, 2024

Description

This PR contains implementation of features gen_substrs, ˆ support, $ support and overall bugfixes for the Noir support.

This branch has been tested equally as the circom implementation.
All circom tests from the original zk-regex lib have been added in the test-suite. All tests pass with the added features and bugfixes.

  • ˆ support is realized by prefixing the input array by 255. This is the same in circom
  • $ support is realized by adding an additional accepting state, to which the previous accepting state transitions for any character. This new state is then added to the accepting states. In the case that $ is at the end of the regex this extra transition is not done and inputs continuing after $ are thus rejected. This solution increases the lookup table size by 255 rows
  • gen_substrs lets us extract substrings alongside the regex check. This can be done via decomposed or raw setting.
    • substring is extracted based on transition information in the DFA
    • the return type per substring is BoundedVec<Field,N>, because we don't know the exact length beforehand
    • the total return type for the regex_match function with substring extraction is Vec<BoundedVec<Field,N>> because the total number of substrings is not always known beforehand
    • only if a substring is part of a valid regex it is added to return output (similar to the consecutive check in circom)
    • the implementation strategy for $ is needed also to extract the exact correct substring (otherwise it would just keep extracting until the end of the input)
    • we set gen_substrs in raw to default (this is a change outside of the Noir code, but seemed to make sense)
  • fix: introduction of a "reset" in state. If the next state is 0, we need to consider a possible transition from state 0 at the current moment as well. Example: regex ab and input aab. For the first input a it moves into state 1. For the second input a it moves into state 0. And then it would stay there. Now, we're adding the possibility for the 2nd occurrence of a to move into state 1 again.

Note: multiple accepting states that would occur directly from the regex are not supported, same as in the circom impl. (See README comment of original lib here).

This replaces previously opened PRs: #2 and #1. (Although the steps for manual verification are still valid)

Additional Context

The test suite is built specifically for the Noir zk-regex library. From a database of regex inputs + samples it will generate the required Noir code, create the desired tests and run them. The database has been filled with the equivalents of the tests for circom. Additionally, there are 2 hardcoded test projects for the circom tests that had more complex circuits (combining multiple templates).

PR Checklist*

  • I have tested the changes locally.
  • I have formatted the changes with Prettier and/or cargo fmt on default settings.

olehmisar and others added 9 commits September 19, 2024 14:13
…aw setting.

The substrings are returned as BoundedVec since we don't know their exact length upfront, but we know they're not longer than N.
To support both settings (decomposed and raw) we have to use `substring_ranges` instead of `substring_boundaries`.
…gex and input. This fix makes sure this is supported.

Changes:
- regex_match returns a Vec of substrings instead of an array with known length
- per state where substrings have to be extracted; add the byte either to a new substring or an already started one

Note that substr_count is used to extract the correct "current" substring from the Vec. This is a workaround - first implementation was using `pop` but this gave an error.
For caret anchor: Mark beginning of input byte array with 255, which makes the check for caret anchor (ˆ) works. Note that ^ is only taken into consideration in the decomposed mode.
…states reachable from state 0.

Substrings only get saved when they are part of a path that doesn't reset.
@TomAFrench
Copy link
Member

@ewynx It seems like the plan is to deprecate this repository based on #9. Would you like to make this PR on the upstream repository directly?

@ewynx
Copy link
Author

ewynx commented Nov 26, 2024

@TomAFrench opened new PR

@ewynx ewynx closed this Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants