Compiles regular expressions into DFAs and generate bytecode on the fly (using
JEP 371) for encoding those DFAs inside control flow. The goal is to
support as much of the JDK Pattern
API as possible (notable exceptions being
possesive quantifiers and backreferences, since those can't be encoded).
Supported features include:
- capture groups
- alternation
- quantifiers (greedy and lazy, but not possesive)
- character classes
This approach has several caveats. The main ones are:
-
the powerset construction for NFA to DFA is exponential (this turns out to usually not be the case for most real world regex).
-
encoding DFAs using control flow means the max method body size of 64k instructions will likely be hit (we can work around this a bit by pulling subgraphs of the DFA out into other methods).
You'll need JDK 16 or newer (this relies on JEP 371 and uses newer Java features) installed as well as SBT. Then,
sbt compile # compile the project
sbt test # compile and run the tests
sbt bench/Jmh/run # compile and run all JMH benchmarks
There's also a test driver that consumes .txt
test cases like those used in
the OpenJDK tests for java.util.regex
(in test/jdk/java/util/regex/*.txt
):
sbt 'tester/run TestCases.txt'
This would not have been possible without the following papers and blogs:
-
This paper (Tagged Deterministic Finite Automata with Lookahead) describes how capture groups can be extracted in a single pass tagged DFA
-
This paper (Efficient Submatch Extraction for Practical Regular Expressions) describes how capture groups can be extracted even using simple DFAs
-
This blog series about the design space and tradeoffs in regular expression search