-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best way of handling languages that have unicode identifiers? #72
Comments
Interesting task you set yourself! Will this be limited to UTF-8 encoding only?
Well, size-wise I don't expect any big problems, I've generated some humongous parsers in the past without running into any issues, although compilation time might increase a bit. Run time performance and memory usage of NPeg parsers should not suffer from grammar size.
I need more details to get a clear mental picture of what you're up to, but I guess one important question is how strict you need your parser to be - do you want to accept valid Ada, and only valid Ada, or is it acceptable to make the parser more lenient and accept things that might violate the spec? The latter case would probably massively simplify the efforts you need to put into this.
Which issue are you referring to? Would https://github.com/zevv/npeg?tab=readme-ov-file#passing-state be what you are looking for?
That's where code blocks can usually help: there is no reason to try to solve everything inside the NPeg parser grammar itself, you are free to delegate parts to Nim code if that helps you. If you find the current API lacking, I'd be more then happy to see if we can mold NPeg to support your use case. |
Yes; I could expand it to UTF-16 and UTF-32 later, though that wouldn't be very hard since I believe all UTF-8 is valid UTF-16 and UTF-32. So I'd still need to support all Unicode code points. (IMO converting the data to UTF-32 would make it easier to work with, if only because then I wouldn't need to worry about problems like "is this byte in the middle of a code point?".) However, since Nim mainly supports UTF-8, that might be a good place to start.
Can you define "things that might violate the spec"? Do you mean accepting a subset of the overall Unicode codepoint space (like extended ASCII)? (For context, I think being more lenient might bite me later because the spec's math library defines both a constant As for making it more complicated, I don't think it necessarily would. I can generate the unicode parsing rules using ucd-generate and then taking that Rust code and dumping a set of Unicode general category rules from that, since I'm too lazy to try to parse the UCD myself lol.
I was referring to the ability to change the token type used, where someone suggested you could use square brackets in the grammar... I think passing state might be it but I might be wrong.
Can you explain this more? How could I overwrite the parser like that in parts if I need to specify the rule to match first? |
If you want to only accept identifiers as specified, you will need a strict definition of all the individual parts ( But of course this depends on your requirements, and indeed it might bite you later
Sounds like a solid plan.
Aah, right, I just realized I lied. In an earlier version of NPeg the internal parser state was accessible to the code blocks (although not documented) so it was possible to inspect and modify the subject string and parser index from Nim code, basically allowing you to say: I'll take over from here, parse a bit with Nim code, update the NPeg state, and then giving control back to NPeg. See the bottom example in the first message at #14. This is no longer possible in the current implementation, although I once planned to bring this back as a proper feature when the need arises. |
Yeah so I don't know how well this is going to work. Is there a way I can include rules from other parsers? Or write rules in a separate file? The size of the UCD means that if I include all general categories the resulting list of rules is about 70 MB and I don't even know if it'll even parse lol. (No, I didn't think it'd be that big, but that's what you get when you have to split multi-byte characters into their individual bytes...) |
This is particularly problematic since the Ada specification uses the graphic character alias category for character literals, which, according to Wikipedia, is "those with General Category Letter, Mark, Number, Punctuation, Symbol or Zs=space." |
Yes, see https://github.com/zevv/npeg?tab=readme-ov-file#composing-grammars-with-libraries |
@zevv Oh good. Though figuring out how to debug this is going to be an utter pain... Especially if it doesn't compile. I really hope I got the syntax right on the first go. |
My advice would be to build your grammar bottom up; first write the parsers for your basic atoms like identifiers, keywords and operators. Write little tests and keep these in place. From there slowly build up the grammar itself, parsing larger and larger fragments. So, what is the end product of this parser for you? Will you be generating AST of the code? |
Most likely, yes. I've written a compiler before (not the cleanest, and for University) but we didn't cover ASTs/semantic analysis/code generation as it is done now. I don't know how far I'll get but I'm willing to give it a go anyway lol |
@zevv Well we might have an issue: when I build this small parser I get this error in debug builds of the compiler, and in release builds it crashes: Error: call depth limit reached in a debug build (2000 function calls). You can change it with I have no idea where the recursion is happening, but it may be happening in the peg macro. Or maybe the grammar one... P pinged you on Matrix. |
I'm considering writing a PEG grammar for Ada, which requires Unicode identifiers. (The grammar is going to be another huge one, so we'll see if npeg can handle it; I saw the issue about Nim choking on an SQL grammar, so...)
I see a couple ways to get this to work:
char
s.The first one would be very, very ugly. Doable but very ugly indeed. Granted, if I go top-down instead of the order the reference manual goes in (bottom-up) they'd be way at the bottom so nobody would actually probably have to read those. But the second one is a bit challenging. One of the closed issues mentioned being able to keep state in the peg macro, but I didn't see anything about that in the README, so is that still a thing?
I suppose a third option would be doing some kind of hack where I figure out how to match identifiers, but hijack the parsing to do that in the code blocks (this might actually be a good feature to add if it isn't already possible for issues just like this one). If I could do that, I could just import nim-unicodedb and check against it's version of the UCD.
The text was updated successfully, but these errors were encountered: