Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement initial roc tokenizer in zig #7569

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

joshuawarner32
Copy link
Collaborator

No description provided.

Comment on lines 26 to 30
.zg = .{
.url = "https://codeberg.org/dude_the_builder/zg/archive/v0.13.2.tar.gz",
.hash = "122055beff332830a391e9895c044d33b15ea21063779557024b46169fb1984c6e40",
},
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we using unicode text processing for in here, I can't find it . Maybe this isn't being used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I think I found it const GenCatData = @import("GenCatData");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are strays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are intentional, and I think they may actually be bugs. Somehow the current parser seems to allow numbers with arbitrary suffixes, which I didn't want to repeat (supposing I am right and it is a bug...).

Either way, I'll need to either correct the test, or implement this behavior since I'm using the existing snapshot tests as a smoke test here (and that one was smoking!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this?

Copy link
Collaborator

@bhansconnect bhansconnect left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really review the correctness of a lot of the actual tokenization logic, but I think that will come with tests (which hopefully many snapshot tests or similar will eventually be added).

Given this is just a quick prototype, I'm not sure my comments truly need to be addressed, but if want this to be the base of the roc zig compiler, we probably should try to address many of them and start with as solid of practices as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that any of this needs to be done in this PR, but we probably want only a single build.zig and build.zig.zon for the entire compiler. So it would either be at the root level or it would be src/build.zig.

The tokenizer should not need it's own build file I don't think. Instead it would just be include from the root main file of the compiler (which obviously doesn't exist yet). But I think for now, we probably could just set this up as a library with tests instead of an executable.

src/check/parse/tokenize/src/root.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/build.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/build.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
return try self.tokenizeStringLikeLiteralBody(false, term, start, multiline);
}

pub fn tokenizeStringLikeLiteralBody(self: *Tokenizer, already_started: bool, term: u8, start: usize, multiline: bool) !T {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be really nice to change the bool inputs to this functions into enums. It is really hard to read a call to this function and understand the args (espically with two bool inputs)

src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
}
};

/// Indent holds counts for spaces and tabs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to leave the same feedback. I think you can't possibly compare indents without knowing the order of tabs and spaces (and also have to make assumptions about tab width). I think that it's normal in WSS languages to have tokens to handle this like:

  • Indent (a newline followed by the next increment of indentation)
  • Dedent (a newline followed by the next decrement of indentation)
  • Newline (a newline followed by the same level of indentation)
  • Comment (just to allow easy collection later)
  • Spaces (to collect spaces between levels of indentation - ie, for alignment - should only appear after Indent, Dedent, or Newline)

Then in your parsers it's a simple left-to-right run over a single array (or in SoA approach, using a single cursor).

}
};

/// Indent holds counts for spaces and tabs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces as a token is really only relevant if we want to either a) allow arbitrary line alignment and/or b) be very forgiving in the whitespace sensitivity. They can also be thrown away(cheap) or "rounded up" to a new level of indentation.

src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved
Comment on lines 649 to 653
while (self.pos < self.buf.len) {
const c = self.buf[self.pos];
if ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9') or c == '_') {
self.pos += 1;
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather see consistent use of peek and advance(to ensure we stay in bounds)

pub const Parser = struct {
pos: usize,
tokens: tokenize.TokenizedBuffer,
nodes: std.MultiArrayList(Node),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to level set, are you expecting all nodes to be push in a flat, linear fashion? And we would have to a have a cursor on the nodes array as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the Zig parser and I think I understand where you are going

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants