Implement initial roc tokenizer in zig #7569

joshuawarner32 · 2025-02-02T03:23:57Z

No description provided.

src/check/parse/tokenize/src/main.zig

lukewilliamboswell · 2025-02-02T04:51:12Z

src/check/parse/tokenize/build.zig.zon

+        .zg = .{
+            .url = "https://codeberg.org/dude_the_builder/zg/archive/v0.13.2.tar.gz",
+            .hash = "122055beff332830a391e9895c044d33b15ea21063779557024b46169fb1984c6e40",
+        },
+    },


What are we using unicode text processing for in here, I can't find it . Maybe this isn't being used?

nvm, I think I found it const GenCatData = @import("GenCatData");

Yup, that's it!

lukewilliamboswell · 2025-02-02T05:27:38Z

crates/compiler/test_syntax/tests/snapshots/pass/tuple_function_annotation.expr.formatted.roc

I think these are strays?

These are intentional, and I think they may actually be bugs. Somehow the current parser seems to allow numbers with arbitrary suffixes, which I didn't want to repeat (supposing I am right and it is a bug...).

Either way, I'll need to either correct the test, or implement this behavior since I'm using the existing snapshot tests as a smoke test here (and that one was smoking!)

lukewilliamboswell · 2025-02-02T05:28:01Z

src/check/parse/tokenize/src/root.zig

I don't think we need this?

bhansconnect

I didn't really review the correctness of a lot of the actual tokenization logic, but I think that will come with tests (which hopefully many snapshot tests or similar will eventually be added).

Given this is just a quick prototype, I'm not sure my comments truly need to be addressed, but if want this to be the base of the roc zig compiler, we probably should try to address many of them and start with as solid of practices as possible.

bhansconnect · 2025-02-02T06:06:19Z

src/check/parse/tokenize/build.zig

Not that any of this needs to be done in this PR, but we probably want only a single build.zig and build.zig.zon for the entire compiler. So it would either be at the root level or it would be src/build.zig.

The tokenizer should not need it's own build file I don't think. Instead it would just be include from the root main file of the compiler (which obviously doesn't exist yet). But I think for now, we probably could just set this up as a library with tests instead of an executable.

crates/compiler/test_syntax/tests/snapshots/pass/tuple_function_annotation.expr.formatted.roc

src/check/parse/tokenize/src/root.zig

src/check/parse/tokenize/build.zig

src/check/parse/tokenize/src/main.zig

bhansconnect · 2025-02-02T06:55:16Z

src/check/parse/tokenize/src/main.zig

+        return try self.tokenizeStringLikeLiteralBody(false, term, start, multiline);
+    }
+
+    pub fn tokenizeStringLikeLiteralBody(self: *Tokenizer, already_started: bool, term: u8, start: usize, multiline: bool) !T {


I think it would be really nice to change the bool inputs to this functions into enums. It is really hard to read a call to this function and understand the args (espically with two bool inputs)

src/check/parse/tokenize/src/main.zig

gamebox · 2025-02-02T12:24:25Z

src/check/parse/tokenize/src/main.zig

+    }
+};
+
+/// Indent holds counts for spaces and tabs.


I was going to leave the same feedback. I think you can't possibly compare indents without knowing the order of tabs and spaces (and also have to make assumptions about tab width). I think that it's normal in WSS languages to have tokens to handle this like:

Indent (a newline followed by the next increment of indentation)

Dedent (a newline followed by the next decrement of indentation)

Newline (a newline followed by the same level of indentation)

Comment (just to allow easy collection later)

Spaces (to collect spaces between levels of indentation - ie, for alignment - should only appear after Indent, Dedent, or Newline)

Then in your parsers it's a simple left-to-right run over a single array (or in SoA approach, using a single cursor).

gamebox · 2025-02-02T12:26:15Z

src/check/parse/tokenize/src/main.zig

+    }
+};
+
+/// Indent holds counts for spaces and tabs.


Spaces as a token is really only relevant if we want to either a) allow arbitrary line alignment and/or b) be very forgiving in the whitespace sensitivity. They can also be thrown away(cheap) or "rounded up" to a new level of indentation.

src/check/parse/tokenize/src/main.zig

gamebox · 2025-02-02T12:34:15Z

src/check/parse/tokenize/src/main.zig

+        while (self.pos < self.buf.len) {
+            const c = self.buf[self.pos];
+            if ((c >= 'a' and c <= 'z') or (c >= 'A' and c <= 'Z') or (c >= '0' and c <= '9') or c == '_') {
+                self.pos += 1;
+            } else {


I'd rather see consistent use of peek and advance(to ensure we stay in bounds)

gamebox · 2025-02-03T13:14:09Z

src/check/parse.zig

+pub const Parser = struct {
+    pos: usize,
+    tokens: tokenize.TokenizedBuffer,
+    nodes: std.MultiArrayList(Node),


Just to level set, are you expecting all nodes to be push in a flat, linear fashion? And we would have to a have a cursor on the nodes array as well?

I looked through the Zig parser and I think I understand where you are going

andrewrk reviewed Feb 2, 2025

View reviewed changes

src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved

andrewrk reviewed Feb 2, 2025

View reviewed changes

src/check/parse/tokenize/src/main.zig Outdated Show resolved Hide resolved

joshuawarner32 added 3 commits February 1, 2025 20:21

Implement initial roc tokenizer in zig

853adb4

Switch to StaticStringMap and remove old ExpectFx keyword

5f78885

Remove inline keyword

61a0779

joshuawarner32 force-pushed the zig-tokenizer branch from 9a7e156 to 61a0779 Compare February 2, 2025 04:22

lukewilliamboswell reviewed Feb 2, 2025

View reviewed changes

Fix test snapshot

0823fc9

lukewilliamboswell reviewed Feb 2, 2025

View reviewed changes

src/check/parse/tokenize/src/root.zig Outdated

Copy link

Collaborator

lukewilliamboswell Feb 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this?

joshuawarner32 reacted with thumbs up emoji

bhansconnect reviewed Feb 2, 2025

View reviewed changes

gamebox reviewed Feb 2, 2025

View reviewed changes

joshuawarner32 added 6 commits February 2, 2025 11:22

zig tokenizer feedback

4e4fd8b

Simplify chompIdentLower (feedback)

b5b4355

Refactor switch for BraceKind, fix suffixes test

ebb78ef

Return a sub-slice of messages for clarity

f6cfe04

Refactor zig build and add skeleton parser

6cb9d0c

Refactor to sneak in indent information to a (new) Newline token

faf5313

gamebox reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement initial roc tokenizer in zig #7569

Implement initial roc tokenizer in zig #7569

joshuawarner32 commented Feb 2, 2025

lukewilliamboswell Feb 2, 2025

lukewilliamboswell Feb 2, 2025

joshuawarner32 Feb 2, 2025

lukewilliamboswell Feb 2, 2025

joshuawarner32 Feb 2, 2025

lukewilliamboswell Feb 2, 2025

bhansconnect left a comment

bhansconnect Feb 2, 2025

bhansconnect Feb 2, 2025

gamebox Feb 2, 2025

gamebox Feb 2, 2025

gamebox Feb 2, 2025

gamebox Feb 3, 2025

gamebox Feb 3, 2025

Implement initial roc tokenizer in zig #7569

Are you sure you want to change the base?

Implement initial roc tokenizer in zig #7569

Conversation

joshuawarner32 commented Feb 2, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhansconnect left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment