-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement initial roc tokenizer in zig #7569
base: main
Are you sure you want to change the base?
Conversation
9a7e156
to
61a0779
Compare
.zg = .{ | ||
.url = "https://codeberg.org/dude_the_builder/zg/archive/v0.13.2.tar.gz", | ||
.hash = "122055beff332830a391e9895c044d33b15ea21063779557024b46169fb1984c6e40", | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are we using unicode text processing for in here, I can't find it . Maybe this isn't being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I think I found it const GenCatData = @import("GenCatData");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are strays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are intentional, and I think they may actually be bugs. Somehow the current parser seems to allow numbers with arbitrary suffixes, which I didn't want to repeat (supposing I am right and it is a bug...).
Either way, I'll need to either correct the test, or implement this behavior since I'm using the existing snapshot tests as a smoke test here (and that one was smoking!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really review the correctness of a lot of the actual tokenization logic, but I think that will come with tests (which hopefully many snapshot tests or similar will eventually be added).
Given this is just a quick prototype, I'm not sure my comments truly need to be addressed, but if want this to be the base of the roc zig compiler, we probably should try to address many of them and start with as solid of practices as possible.
src/check/parse/tokenize/build.zig
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that any of this needs to be done in this PR, but we probably want only a single build.zig
and build.zig.zon
for the entire compiler. So it would either be at the root level or it would be src/build.zig
.
The tokenizer should not need it's own build file I don't think. Instead it would just be include from the root main file of the compiler (which obviously doesn't exist yet). But I think for now, we probably could just set this up as a library with tests instead of an executable.
crates/compiler/test_syntax/tests/snapshots/pass/tuple_function_annotation.expr.formatted.roc
Outdated
Show resolved
Hide resolved
return try self.tokenizeStringLikeLiteralBody(false, term, start, multiline); | ||
} | ||
|
||
pub fn tokenizeStringLikeLiteralBody(self: *Tokenizer, already_started: bool, term: u8, start: usize, multiline: bool) !T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be really nice to change the bool inputs to this functions into enums. It is really hard to read a call to this function and understand the args (espically with two bool inputs)
} | ||
}; | ||
|
||
/// Indent holds counts for spaces and tabs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to leave the same feedback. I think you can't possibly compare indents without knowing the order of tabs and spaces (and also have to make assumptions about tab width). I think that it's normal in WSS languages to have tokens to handle this like:
- Indent (a newline followed by the next increment of indentation)
- Dedent (a newline followed by the next decrement of indentation)
- Newline (a newline followed by the same level of indentation)
- Comment (just to allow easy collection later)
- Spaces (to collect spaces between levels of indentation - ie, for alignment - should only appear after Indent, Dedent, or Newline)
Then in your parsers it's a simple left-to-right run over a single array (or in SoA approach, using a single cursor).
} | ||
}; | ||
|
||
/// Indent holds counts for spaces and tabs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces as a token is really only relevant if we want to either a) allow arbitrary line alignment and/or b) be very forgiving in the whitespace sensitivity. They can also be thrown away(cheap) or "rounded up" to a new level of indentation.
while (self.pos < self.buf.len) { | ||
const c = self.buf[self.pos]; | ||
if ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9') or c == '_') { | ||
self.pos += 1; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather see consistent use of peek and advance(to ensure we stay in bounds)
pub const Parser = struct { | ||
pos: usize, | ||
tokens: tokenize.TokenizedBuffer, | ||
nodes: std.MultiArrayList(Node), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to level set, are you expecting all nodes to be push in a flat, linear fashion? And we would have to a have a cursor on the nodes array as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through the Zig parser and I think I understand where you are going
No description provided.