Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolk v0.7: overhaul compiler internals and the type system; bool type #1477

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tolk-vm
Copy link
Contributor

@tolk-vm tolk-vm commented Jan 13, 2025

Two months have passed since the announcement of Tolk. You might be wondered, what was going on and why there we no releases yet.

Throughout all November, I've been working on the vision of the future. My goal was to "visualize" what Tolk v1.0 should look like. What's the language we're all targeting to, so that it solves lots of practical problems, avoids manual cells/slices manipulation, provides sufficient mechanisms for ABI generation, but still being zero overhead. I have created a giant roadmap (40 PDF pages!) describing the vision, and how, step by step, we're going to reach it.

Throughout all December, I've been constantly working on the compiler's kernel. As you know, Tolk is a fork of FunC. FunC compiler internals are very challenging to be extended and modified. The way FunC looks like is just a mirror of its internal implementation. Heading towards the future, I had to partially "untangle" this "legacy FunC core", so that in the future, it will be able to "interbreed" with features it was not originally designed for.

Currently I am done with this preparation. Tolk v0.7 contains a fully rewritten semantic analysis kernel (though almost invisible to the end user, huh).

Notable changes in Tolk v0.7

  1. Under the hood: refactor and revamp compiler internals. AST-level semantic analysis kernel
  2. Under the hood: rewrite the type system from Hindley-Milner to static typing
  3. Clear and readable error messages on type mismatch
  4. Generic functions fun f<T>(...) and instantiations like f<int>(...)
  5. The bool type
  6. Type casting via value as T

The documentation and IDE plugins have been updated accordingly (see related pull requests below).

Now, let's cover every bullet in detail.

Refactor and revamp compiler internals

As of Tolk v0.6, I've managed to implement parsing source files to AST (completely missed in FunC), which gave me a control over syntax. That's why changes in Tolk v0.6 were almost syntactical only. AST, after being parsed, was transformed to a "legacy core", forcing all the rest FunC "forked core" to work.

image

Heading towards the future the AST should be converted directly to IR (intermediate representation), performing all semantic analysis on the AST level.

image

At the AST level, it is necessary to handle: lvalue/rvalue semantics; mutability analysis; unreachable code detection; symbol resolving; type inference and checks; various other validity checks.

This step is primarily about creating the foundational for semantic analysis, laying the groundwork for future enhancements.

Implementation details:

  • AST is converted directly to Op (a kind of IR representation), doing all code analysis (see above) at AST level
  • values of const variables are now calculated NOT based on CodeBlob, but via a newly-introduced AST-based constant evaluator
  • AST vertices are now inherited from expression/statement/other; expression vertices have common properties (TypeExpr, lvalue/rvalue)
  • symbol table is rewritten completely, SymDef/SymVal no longer exist, lexer now doesn't need to register identifiers
  • AST vertices have references to symbols, filled at different stages of pipeline
  • the remaining "FunC legacy part" is almost unchanged besides Expr which was fully dropped

Rewrite the type system from Hindley-Milner to static typing

You know, that FunC is "functional C". But do you know, what makes it "functional"? Not the fact that FunC is very close to TVM. Not its peculiar syntax. And even not the ~ tilda. "Functional" is mostly about Hindley-Milner type system, that had no conceptual changes in earlier Tolk, but is fully replaced now.

Hindley-Milner type system is a common approach for functional languages, where types are inferred from usage through unification. As a result, type declarations are not necessary:

() f(a, b) {
    return a + b;   // a and b now int, since `+` (int, int)
}

For example,

() f(slice s) {} 

var s = null;
f(s);  // infer s as slice, since f accepts slice

For example,

int f(x) {
    (a, b) = (0, x);
    return a + b;   // x becomes int, since x and b edge
}

In the FunC codebase, te_Indirect is about this, along with forall, which comes with its own nuances.

While this approach works for now, problems arise with the introduction of new types like bool, where !x must handle both int and bool. It will also become incompatible with int32 and other strict integers.

Example. When nullable types are introduced, we want null not be assignabled to int. However, with unification, the following would be valid:

var x: int = 0;   // unify(Hole, Int) = Int
...
x = null;         // unify(Int, Nullable<Hole>) = Nullable<Int>

Instead of an error, Hindley-Milner would perform unification and accept it. This will clash with structure methods, struggle with proper generics, and become entirely impractical for union types (despite claims that it was "designed for union types").

A fun fact: this is not noticeable now. Because the current type system is very limited. But as soon as we add bool, fixed-width integers, nullability, structures, and generics, these problems will become significant.

The goal is to have predictable, explicit, and positionally-checked static typing. While Hindley-Milner is powerful, it's actually "type inference for the poor" — simple to implement when there's no time to fundamentally design the language.

Static typing (similar to TypeScript without any or Rust) is a must-have, even though implementing it is quite complex. Key aspects include:

  • variable types are inferred from declarations explicitly. var i = 0 is int (not "unify int" as now); var c = null is forbidden, use var c: int? = null; var a = b is okay since the type of b is known at that point
  • on every modification, type of a variable is validated, not unified
  • no auto types; function parameters must be strictly typed
  • function return types, if unspecified, inferred from return statements through a mechanism similar to unification; in case of recursion (direct or indirect), the return type must be explicitly declared somewhere
  • elimination of te_Indirect, each node's type will be directly inferred during analysis
  • no forall types, generic functions need to be resolved differently, since types are known during node analysis; saving a generic function into a variable is denied
  • generic parameters can be inside functions, like fun f<T>(a: T) { var b: [T] = [a]; }
  • thoroughly define what is null and how it interacts with assignments
  • constructions like t.tupleAt(0) (it's a generic method where T doesn't depend on arguments) should have "external hint" propagated, see below about generics
  • instead of unreadable unification errors, type mismatch should result in clear errors for assignment edges

Ideally, type inference should rely on a control flow graph, which we currently lack. It will be implemented later. For now, the existing AST representation will suffice.

Implementation details:

  • type of any expression is inferred and never changed
  • this is available because dependent expressions are already inferred
  • forall completely removed, generic functions introduced (they work like template functions actually, instantiated while inferring)
  • instantiation <...> syntax, example: t.tupleAt<int>(0)
  • as keyword, for example t.tupleAt(0) as int
  • methods binding is done along with type inferring, not before ("before", as worked previously, was always a wrong approach)

Clear and readable error messages on type mismatch

In FunC, due to Hindley-Milner, type mismatch errors are very hard to understand:

error: previous function return type (int, int) cannot be unified 
with implicit end-of-block return type (int, ()): cannot unify type () with int

After full reconsideration of the type system, they became human-readable:

1) can not assign `(int, slice)` to variable of type `(int, int)`
2) can not call method for `builder` with object of type `int`
3) can not use `builder` as a boolean condition
4) missing `return`
...

Generic functions fun f<T>(...) and instantiations like f<int>(...)

In FunC, there were "forall" functions:

forall X -> tuple tpush(tuple t, X value) asm "TPUSH";

In Tolk v0.6, the syntax changed to remind mainstream languages:

fun tuplePush<T>(mutate self: tuple, value: T): void
    asm "TPUSH";

But the change was only about the syntax. Under the hood, it was transformed to exactly the same representation, since forall was a part of the type system.

To replace Hindley-Milner type system, I had to implement support for generic functions. When f<T> is called, T is detected (in most cases) by provided arguments:

t.tuplePush(1);     // detected T=int
t.tuplePush(cs);    // detected T=slice
t.tuplePush(null);  // error, need to specify "null of what type"

The syntax f<int>(...) is also supported:

t.tuplePush<int>(1);     // ok
t.tuplePush<int>(cs);    // error, can not pass slice to int
t.tuplePush<int>(null);  // ok, null is "null of type int"

User-defined functions may also be generic:

fun replaceLast<T>(mutate self: tuple, value: T) {
    val size = self.tupleSize();
    self.tupleSetAt(value, size - 1);
}

Having called replaceLast<int> and replaceList<slice> will result in TWO generated asm (fift) functions. Actually, they mostly remind "template" functions. At each unique invocation, function's body is fully cloned under a new name.

There may be multiple generic parameters:

fun replaceNulls<T1, T2>(tensor: (T1, T2), v1IfNull: T1, v2IfNull: T2): (T1, T2) {
    var (a, b) = tensor;
    return (a == null ? v1IfNull : a, b == null ? v2IfNull : b);
}

A generic parameter T may be something complex.

fun duplicate<T>(value: T): (T, T) { 
    var copy: T = value;
    return (value, copy); 
}

duplicate(1);         // duplicate<int>
duplicate([1, cs]);   // duplicate<[int, slice]>
duplicate((1, 2));    // duplicate<(int, int)>

Or even functions, it also works:

fun callAnyFn<TObj, TResult>(f: (TObj) -> TResult, arg: TObj) { 
    return f(arg); 
}

fun callAnyFn2<TObj, TCallback>(f: TCallback, arg: TObj) { 
    return f(arg); 
}

Note, that while generic T are mostly detected from arguments, there are not so obvious corner cases, when T does not depend from arguments:

fun tupleLast<T>(self: tuple): T
    asm "LAST";

var last = t.tupleLast();    // error, can not deduce T

To make this valid, T should be provided externally:

var last: int = t.tupleLast();       // ok, T=int
var last = t.tupleLast<int>();       // ok, T=int
var last = t.tupleLast() as int;     // ok, T=int

someF(t.tupleLast());       // ok, T=(paremeter's declared type)
return t.tupleLast();       // ok if function specifies return type

Also note, that T for asm functions must occupy 1 stack slot (otherwise, asm body is unable to handle it properly), whereas for a user-defined function, T could be of any shape.

In the future, when structures and generic structures are implemented, all the power of generic functions will come into play. Implementing them now was a necessary step of getting rid of Hindley-Milner.

bool type, casting boolVar as int

With controlled type checking operating directly on the AST, it became be possible to introduce a proper bool type. Under the hood, bool is still -1 and 0 at TVM level, but from the type system's perspective, bool and int are now different.

Comparison operators == / >= /... return bool. Logical operators && || return bool. Constants true and false have the bool type. Lots of stdlib functions now return bool, not int (having -1 and 0 at runtime):

var valid = isSignatureValid(...);    // bool
var end = cs.isEndOfSlice();          // bool
var isHyphen = char == 45;            // bool

Operator !x supports both int and bool. Condition of if and similar accepts both int (!= 0) and bool. Logical && and || accept both bool and int, preserving compatibility with constructs like a && b where a and b are integers (!= 0).

Arithmetic operators are restricted to integers, only bitwise and logical allowed for bools:

valid && end;          // ok
valid & end;           // ok, bitwise & | ^ also work if both are bools
if (!end)              // ok

if (~end)              // error, use !end
valid + end;           // error
8 & valid;             // error, int & bool not allowed

This is a breaking change since in many real-world contracts, values previously treated as integers will now be booleans, and invalid operations on them will result in compilation errors.

The compiler does some optimizations for booleans. Example: boolVar == true -> boolVar. Example: !!boolVar -> boolVar. Example: !x for int results in asm 0 EQINT, but !x for bool results in asm NOT.

Note, that logical operators && || (missed in FunC) use IF/ELSE asm representation always. In the future, for optimization, they could be automatically replaced by & | when it's safe (example: a > 0 && a < 10). To manually optimize gas consumption, you can still use & | (allowed for bools), but remember, that they are not short-circuit.

Assigning bool to int is prohibited to avoid unintentional errors:

var isHyphen: int = char == 45;   // error, can not assign bool to int

If you really it, bool can be cast to int via as operator:

var i = boolValue as int;  // -1 / 0

There are no runtime transformations. bool is guaranteed to be -1/0 at TVM level, so this is type-only casting. But generally, if you need such a cast, probably you're doing something wrong (unless you're doing a tricky bitwise optimization).

Related pull requests

What's coming next?

I spent lots of time on creating the detailed Roadmap and preparing the compiler's kernel for future language changes. Finally, we'll reach structures with auto packing to/from cells.

There will be several publicly available releases while heading this way, mostly dedicated to type system enrichment and stack management. The next will be available quite soon, stay tuned.

This is a huge refactoring focusing on untangling compiler internals
(previously forked from FunC).
The goal is to convert AST directly to Op (a kind of IR representation),
doing all code analysis at AST level.

Noteable changes:
- AST-based semantic kernel includes: registering global symbols,
  scope handling and resolving local/global identifiers,
  lvalue/rvalue calc and check, implicit return detection,
  mutability analysis, pure/impure validity checks,
  simple constant folding
- values of `const` variables are calculated NOT based on CodeBlob,
  but via a newly-introduced AST-based constant evaluator
- AST vertices are now inherited from expression/statement/other;
  expression vertices have common properties (TypeExpr, lvalue/rvalue)
- symbol table is rewritten completely, SymDef/SymVal no longer exist,
  lexer now doesn't need to register identifiers
- AST vertices have references to symbols, filled at different
  stages of pipeline
- the remaining "FunC legacy part" is almost unchanged besides Expr
  which was fully dropped; AST is converted to Ops (IR) directly
@tolk-vm tolk-vm added the Tolk Related to Tolk Language / compiler / tooling label Jan 13, 2025
FunC's (and Tolk's before this PR) type system is based on Hindley-Milner.
This is a common approach for functional languages, where
types are inferred from usage through unification.
As a result, type declarations are not necessary:
() f(a,b) { return a+b; } // a and b now int, since `+` (int, int)

While this approach works for now, problems arise with the introduction
of new types like bool, where `!x` must handle both int and bool.
It will also become incompatible with int32 and other strict integers.
This will clash with structure methods, struggle with proper generics,
and become entirely impractical for union types.

This PR completely rewrites the type system targeting the future.
1) type of any expression is inferred and never changed
2) this is available because dependent expressions already inferred
3) forall completely removed, generic functions introduced
   (they work like template functions actually, instantiated while inferring)
4) instantiation `<...>` syntax, example: `t.tupleAt<int>(0)`
5) `as` keyword, for example `t.tupleAt(0) as int`
6) methods binding is done along with type inferring, not before
   ("before", as worked previously, was always a wrong approach)
Comparison operators `== / >= /...` return `bool`.
Logical operators `&& ||` return bool.
Constants `true` and `false` have the `bool` type.
Lots of stdlib functions return `bool`, not `int`.

Operator `!x` supports both `int` and `bool`.
Condition of `if` accepts both `int` and `bool`.
Arithmetic operators are restricted to integers.
Logical `&&` and `||` accept both `bool` and `int`.

No arithmetic operations with bools allowed (only bitwise and logical).
Totally, v0.7 will include:
- AST-level semantic kernel, transform AST to Ops directly
- fully rewritten type system, drop Hindley-Milner
- `bool` type support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tolk Related to Tolk Language / compiler / tooling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant