-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input wanted: design and implement a proper API for code blocks #14
Comments
I'd like a hook which allows code execution when a rule is looked up. let p = peg foo:
onLookUp:
# (...)
onMatch:
# (...) |
Yeah, this is one of my nitpicks as well. I usually solve this with always-match rules in the grammar like so:
I do like your idea, but I'm afraid it will add a lot of noise for the cases where only the match code block is required. Maybe with some macro magic we can make it so that wihout explicit hook the block acts like the onMatch, keeping backwards compatibility, and only when there are hooks defined, they are used. Let me see how that would fit in. |
Suggestions:
Note that I haven't really evaluated the feasibility for the above suggestions. It's been a while since I've looked at Npeg's implementation, and I wanted to get down these thoughts while they were fresh in my mind. |
Quoting Varriount (2019-12-03 15:59:48)
- Rather than expose parser state as multiple variables, expose it as
a single variable. This would be consistent with how the capture data
is presented. Alternately, the parser state could be represented as a
variable, and the capture would be a field in the parser state.
Yeah, I have moved stuff to and from that model a few times. The state
itself is in an object, but everything is moved to local variables
during the match for performance reasons. I'm also not sure how to
expose the subject string in a different variable, as that would require
a copy of the subject, which is potentially large. I shall do some tests
with shallow() to see if that helps.
- Use more descriptive names than `s`, `si`, etc. for variables and members.
Sure, the current variables are bad but short because of the
implementation. the also have a high chance of clashing with code in the
blocks, this already happens to me too often.
- Allow the code block to optionally use return statements to
indicate whether a match succeeded or failed. That being said, such
a change might require that all code blocks have a return statement.
Nope, it will just return the default value for the type then. I did
some tests with making the code blocks procs, but I can not find a way
to keep the compiler happy with captures and closers and gc safety. I
alreday do some nasty tricks with injects and identifiers to pass stuff
around, and i was hoping someone who really understands this all better
would one day point me out how to actually do that proper
Note that I haven't really evaluated the feasibility for the above
suggestions. It's been a while since I've looked at Npeg's
implementation, and I wanted to get down these thoughts while they
were fresh in my mind.
No problem, any input is valuable. It will take me a few tries probably
to get this right anyway, all of the above are good points. thanks!
…--
:wq
^X^Cy^K^X^C^C^C^C
|
Could a "subject" procedure be made for the capture object, allocates the captured string when accessed? Due to UFCS, it would still look like attribute access: let s = capture.subject
What about passing in the parser state and capture object as parameters, and marking the function as both inline and giving it a regular calling convention? proc match(s: string) =
proc inner(parser: Parser, capture: Capture): bool {.inline, nimcall.} =
discard You might be able to get rid of the |
What about passing in the parser state and capture object as
parameters, and marking the function as both inline and giving it a
regular calling convention?
Seems like a sane approach, I'll give it another try. Thanks for
pointing that out!
…--
:wq
^X^Cy^K^X^C^C^C^C
|
I suggest to use the return of the code block to pass a parsed value to the upper rule/capture. This is essential for anyone to construct a custom AST without using hacks to catch the proper tree context. Without this, code blocks are useless when trying to construct an AST. Use a tree structure to cache captures (instead of the list you are using), so you can easily remove entire failed branches and detect the tree context. Execute code blocks when the full match is completed and ok. |
A better alternative to the compiler flag can be: pass a object to the code block where you can set two callback functions.
And, add an option in the parser to disable backtracing for languages that don't require it. You can execute both callbacks in the same pass in this option. |
Do you mean when captures are nested? Because apart from that there is no apparent upper/lower or parent/child relationship between rules at this time.
But what if we want to use a code block to do the validation and pass/fail from it's result?
Let's call it "suboptimal", not useless - see the examples with perfectly functional parsers.
But what would dictate the tree structure? The input is just a linear stream, where do the relationships come from?
See above: code blocks have multiple uses, one of those is using Nim code to do manual parsing and let the parser fail or succeed from there. I feel you are looking for a way to let NPeg have a sense of AST in its parser internally, but at this time the parsing is just sequential: seq-of-chars go in, seq-of-captures comes out. Structure is added by the user of NPeg by building AST in captures. Note that NPeg did have this kind of functionality in the past (see below), but I decided to deprecate this - it proved not flexible enough because the order of nodes in the stream often does not match the order of the nodes as they should end up in the AST tree, thus the user ends up with a generic tree where things are dangling in the wrong place. Documentation of deprecated AST (Abstract Syntax Tree) capturesNote: AST captures is an experimental feature, the implementation or API might NPeg has a simple mechanism for storing captures in a tree data structure, The basic AST node has the following layout: ASTNode* = ref object
id*: string # user assigned AST node ID
val*: string # string capture
kids*: seq[ASTNode] # child nodes To parse a subject and capture strings into a tree of
The following snippet shows an example of creating an AST from arithmetic type Kind* = enum kInt, kAdd, kSub, kMul, kDiv
let s = peg "line":
line <- exp * !1
number <- A(kInt, >+Digit)
add <- A(kAdd, term * '+' * exp)
sub <- A(kSub, term * '-' * exp)
mul <- A(kMul, factor * '*' * term)
divi <- A(kDiv, factor * '/' * term)
exp <- add | sub | term
term <- mul | divi | factor
factor <- number | '(' * exp * ')'
let r = s.match("1+2*(3+4+9)+5*6")
let ast = r.capturesAST()
echo ast This will generate an AST tree with the following layout:
|
That's not a bad idea, and with a bit of syntactic sugar this could be nicely wrapped up in some blocks instead of explicit callbacks. Something like:
This is still a bit of a mismatch with how NPeg works at this time: there is no concept of an "upper rule" - all rules are sequential and there is no tree representation. In the message above you can see this could work by making the tree explicit in the syntax, but in practice this proves of limited use because the generated AST can only mirror the exact order of the input stream. |
I didn't know about the A operator, but I don't believe it's a good solution because it doesn't allow for arbitrary code execution and for custom ASTs. After analyzing the NPeg code I believe I have a more concrete proposal with no impact on the current spec. But you have to fill and check the details.
template fail(msg = "")
template check(expr: bool, msg = "")
template capture[T](`s`: T) Deprecate the old ones, but they should still work.
Note that, the Grabbing the doc example: type MyNode* = object
key*: string
val*: int
let parser = peg("pairs"):
pairs <- >pair * *(',' * >pair) * !1:
# do something with the captured result. Ideally it should be of type seq[MyNode]
word <- +Alpha
number <- +Digit
pair <- >word * '=' * >number:
let node = MyNode(key: %1, val: parseInt(%2))
capture(node) The algorithm execution goes like:
Improvements:
Challenges:
|
Thanks for giving this a good think. I'll have to go through this a few times and properly digest it, I'll probably be back with questions. |
@shumy-tools One thing I would like to point out is that what one might consider the optimal layout/ruleset for a grammar (in terms such as readability, performance, etc.) may not match up with their AST (also, there are uses for NPeg other than AST building). I don't imagine this to be too common an occurrence, but it is possible. |
@zevv Hi, I've been running into issues with limitations around code blocks: specifically because they're executed regardless of whether they're matched or not. A particular example I had trouble with was implementing a URI parser (not just validator), where I had quite a lot of backtracking due to some optional blocks. I don't know if the backtracking could be eliminated by proper use of precedence rules (I've yet to understand those) but figured I'd share what I ended up going with to sort of highlight what I find difficult to do in the current model. Example of a URI PEGNote the gross trick this PEG is pulling: it checks if optional captures were matched by checking the length of let parser = peg("uri", url: Url):
unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
gendelims <- {':', '/', '?', '#', '[', ']', '@'}
subdelims <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
reserved <- gendelims | subdelims
percentenc <- "%" * Xdigit[2]
pathchars <- unreserved | percentenc | subdelims
decpiece <- Digit |
("1" * Digit * Digit) |
("2" * {'0'..'4'} * Digit) |
("25" * {'0'..'5'})
ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece
ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
ipv6addr <- (ipv6piece * ":")[7] * ipv6piece |
(ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
(ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
(ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
(ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
(ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
(ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
":" * (":" * ipv6piece)[7]
scheme <- Alpha * *(Alpha | Digit | '+' | '-' | '.'):
url.scheme = $0
userinfo <- *(pathchars | ":"):
url.userinfo = $0
host <- ipv4addr | "[" * ipv6addr * "]" | *pathchars:
url.host = $0
port <- *Digit:
url.port = parseInt($0)
hostport <- host * ?(":" * >port):
if capture.len != 2:
url.port = 0
authority <- ?(>userinfo * "@") * hostport:
url.authority = $0
if capture.len != 2:
url.userinfo = ""
path <- (?("/") * *(+pathchars * *("/" * *pathchars))):
url.path = $0
relative <- ("//" * >authority * path) | path:
if capture.len != 2:
url.authority = ""
url.host = ""
query <- *(pathchars | ":" | "@" | "/" | "?"):
url.query = $0
fragment <- *(pathchars | ":" | "@" | "/" | "?"):
url.fragment = $0
schemerelative <- ?(>scheme * ":") * relative:
if capture.len != 2: url.scheme = ""
schemerelativequery <- schemerelative * ?("?" * >query):
if capture.len != 2: url.query = ""
uri <- schemerelativequery * ?("#" * >fragment):
if capture.len != 2: url.fragment = "" I've been thinking a little bit about how to solve this: because the simple "don't execute code blocks for stuff that isn't matched" isn't a great solution: since one purpose of code blocks is to further restrict matches, it'd lead to a bunch of implicit ordering rules. My second idea might work. Essentially, I'd like it if code blocks and captures always returned a type: when there are no captures, this is simply a string ( let parser = peg:
rule <- expression: Type =
# arbitrary code execution
return capture This would still allow the above example - mutating a global-ish variable - but it'd be an antipattern and better solved by a sequence of returning types. For an example of what this would look like, I rewrote the original PEG I had problems with above below. Example with code blocks returning typesimport questionable
let parser = peg("uri"):
unreserved <- {'a'..'z', '0'..'9', '-', '.', '_', '~'}
gendelims <- {':', '/', '?', '#', '[', ']', '@'}
subdelims <- {'!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='}
reserved <- gendelims | subdelims
percentenc <- "%" * Xdigit[2]
pathchars <- unreserved | percentenc | subdelims
decpiece <- Digit |
("1" * Digit * Digit) |
("2" * {'0'..'4'} * Digit) |
("25" * {'0'..'5'})
ipv4addr <- decpiece * "." * decpiece * "." * decpiece * "." * decpiece
ipv6piece <- {'0'..'9', 'a'..'f'}[1..4]
ipv6addr <- (ipv6piece * ":")[7] * ipv6piece |
(ipv6piece * ":")[0..6] * ":" * (":" * ipv6piece)[1] |
(ipv6piece * ":")[0..5] * ":" * (":" * ipv6piece)[2] |
(ipv6piece * ":")[0..4] * ":" * (":" * ipv6piece)[3] |
(ipv6piece * ":")[0..3] * ":" * (":" * ipv6piece)[4] |
(ipv6piece * ":")[0..2] * ":" * (":" * ipv6piece)[5] |
(ipv6piece * ":")[0..1] * ":" * (":" * ipv6piece)[6] |
":" * (":" * ipv6piece)[7]
scheme <- Alpha * *(Alpha | Digit | '+' | '-' | '.')
userinfo <- *(pathchars | ":")
host <- ipv4addr | "[" * ipv6addr * "]" | *pathchars
port <- >*Digit: int =
return parseInt($1)
authority <- ?(>userinfo * "@") * >host * ?(":" * >port): (string, ?string, string, ?int) =
return ($0, $1, $2, $3)
path <- (?("/") * *(+pathchars * *("/" * *pathchars)))
query <- *(pathchars | ":" | "@" | "/" | "?")
fragment <- *(pathchars | ":" | "@" | "/" | "?")
uri <- ?(>scheme * ":") * ?("//" * >authority) * >path * ?("?" * >query) * ?("#" * >fragment): Url =
assert $1 is ?string and $2 is ?(string, ?string, string, ?int)
assert $3 is string and $4 is ?string and $5 is ?string
return Url(scheme: $1 |? "", authority: $2.?0 |? "",
userinfo = $2.?1 |? "", host = $2.?2, port = $2.?3 |? 0
path: $3, query: $4 |? "", fragment: $5 |? "") To go into more detail: Instead of having a
I'm not sure how much I like my second proposal. It's pretty syntactically dense, albeit with mostly standard syntax and significantly aided by questionable. The idea of rules having a type that can then be captured has stuck in my head though: I don't know what the nicest cleanest way to do it would be. Sorry if this is a bit of a brain dump, I don't have all the semantics of this approach thought through, but wanted to get them down somewhere. |
First, I would like to thank you for creating the powerful npeg project. It is really intuitive and easy to understand. After using it for a while, I found this situation, which led to this issue. Here is my immature idea, and referring to the above discussion, I think it is indeed necessary to represent each capture as import npeg
type
TokenKind = enum
Id, Ellipsis, Comma
Token = ref object
kind: TokenKind
str: string
Context = ref object
proc `==`(t: Token, tk: TokenKind): bool = t.kind == tk
let lexer = peg(input, tokens: seq[Token]):
id <- +Alpha:
tokens.add Token(kind: Id, str: $0)
ellipsis <- "...":
tokens.add Token(kind: Ellipsis, str: $0)
comma <- *Blank * >',' * *Blank:
tokens.add Token(kind: Comma, str: $1)
args <- id * *(comma * id) * ?(comma * ellipsis)
input <- '(' * ?args * ')' * !1
let parser = peg(input, Token, s: seq[string]):
arg <- [Id]:
s.add ($0).str
comma <- [Comma]:
s.add ($0).str
ellipsis <- [Ellipsis]:
s.add ($0).str
args <- arg * *(comma * arg)
argList <- args * ?(comma * ellipsis)
input <- argList * !1
var tokens: seq[Token]
doAssert lexer.match("(a, ...)", tokens).ok
var s: seq[string]
doAssert parser.match(tokens, s).ok In the lexer, I do not want to re-analyze the indeterminate-length result of In addition, in the This might help to access indeterminate-length result, but not the issues generated by traceback. |
Here's a request to all NPeg users: I'm looking for some ideas and feedback for the proper design of the API for grammar code blocks. At this time the available functions feel ad hoc and scattered, and I don't dare to call this v1.0 before a better interface is in place.
Here is a list of things a code block capture can do, and how this works now:
Access captures: when a code block executes, all closed captures that were made inside the rule, and the code block capture itself are available to the code block as the variable
capture
which is of typeseq[Capture]
. TheCapture
object has a few public members that can be used from the code block:s
is the captured string andsi
is the index into the original subject where the capture was matched. The capture stringss
are also accessible through a sugar macro that rewrites the code block as$n
:capture[0]
or$0
is the total capture of the rulecapture[1..]
or$1
are the explicit string captures inside the ruleForce a rule to fail:
fail()
which will cause the match to fail - this can be used to validate matches with Nim code, for example to check integer ranges, enum values, or symbols that were captured earlier.validate(b: bool)
, which is nothing more thenif not b: fail()
For example:
All of the above features were introduced in NPeg one at a time, growing the API lacking a proper design.
My question: does the above feel as kludgy as I think it is, or is it fine as it is now? Any ideas for improvement or what this should look like?
The text was updated successfully, but these errors were encountered: