Tolk v0.7: overhaul compiler internals and the type system; `bool` type #1477

tolk-vm · 2025-01-13T16:34:02Z

Two months have passed since the announcement of Tolk. You might be wondered, what was going on and why there we no releases yet.

Throughout all November, I've been working on the vision of the future. My goal was to "visualize" what Tolk v1.0 should look like. What's the language we're all targeting to, so that it solves lots of practical problems, avoids manual cells/slices manipulation, provides sufficient mechanisms for ABI generation, but still being zero overhead. I have created a giant roadmap (40 PDF pages!) describing the vision, and how, step by step, we're going to reach it.

Throughout all December, I've been constantly working on the compiler's kernel. As you know, Tolk is a fork of FunC. FunC compiler internals are very challenging to be extended and modified. The way FunC looks like is just a mirror of its internal implementation. Heading towards the future, I had to partially "untangle" this "legacy FunC core", so that in the future, it will be able to "interbreed" with features it was not originally designed for.

Currently I am done with this preparation. Tolk v0.7 contains a fully rewritten semantic analysis kernel (though almost invisible to the end user, huh).

Notable changes in Tolk v0.7

Under the hood: refactor and revamp compiler internals. AST-level semantic analysis kernel
Under the hood: rewrite the type system from Hindley-Milner to static typing
Clear and readable error messages on type mismatch
Generic functions fun f<T>(...) and instantiations like f<int>(...)
The bool type
Type casting via value as T

The documentation and IDE plugins have been updated accordingly (see related pull requests below).

Now, let's cover every bullet in detail.

Refactor and revamp compiler internals

As of Tolk v0.6, I've managed to implement parsing source files to AST (completely missed in FunC), which gave me a control over syntax. That's why changes in Tolk v0.6 were almost syntactical only. AST, after being parsed, was transformed to a "legacy core", forcing all the rest FunC "forked core" to work.

Heading towards the future the AST should be converted directly to IR (intermediate representation), performing all semantic analysis on the AST level.

At the AST level, it is necessary to handle: lvalue/rvalue semantics; mutability analysis; unreachable code detection; symbol resolving; type inference and checks; various other validity checks.

This step is primarily about creating the foundational for semantic analysis, laying the groundwork for future enhancements.

Implementation details:

AST is converted directly to Op (a kind of IR representation), doing all code analysis (see above) at AST level
values of const variables are now calculated NOT based on CodeBlob, but via a newly-introduced AST-based constant evaluator
AST vertices are now inherited from expression/statement/other; expression vertices have common properties (TypeExpr, lvalue/rvalue)
symbol table is rewritten completely, SymDef/SymVal no longer exist, lexer now doesn't need to register identifiers
AST vertices have references to symbols, filled at different stages of pipeline
the remaining "FunC legacy part" is almost unchanged besides Expr which was fully dropped

Rewrite the type system from Hindley-Milner to static typing

You know, that FunC is "functional C". But do you know, what makes it "functional"? Not the fact that FunC is very close to TVM. Not its peculiar syntax. And even not the ~ tilda. "Functional" is mostly about Hindley-Milner type system, that had no conceptual changes in earlier Tolk, but is fully replaced now.

Hindley-Milner type system is a common approach for functional languages, where types are inferred from usage through unification. As a result, type declarations are not necessary:

() f(a, b) {
    return a + b;   // a and b now int, since `+` (int, int)
}

For example,

() f(slice s) {} 

var s = null;
f(s);  // infer s as slice, since f accepts slice

For example,

int f(x) {
    (a, b) = (0, x);
    return a + b;   // x becomes int, since x and b edge
}

In the FunC codebase, te_Indirect is about this, along with forall, which comes with its own nuances.

While this approach works for now, problems arise with the introduction of new types like bool, where !x must handle both int and bool. It will also become incompatible with int32 and other strict integers.

Example. When nullable types are introduced, we want null not be assignabled to int. However, with unification, the following would be valid:

var x: int = 0;   // unify(Hole, Int) = Int
...
x = null;         // unify(Int, Nullable<Hole>) = Nullable<Int>

Instead of an error, Hindley-Milner would perform unification and accept it. This will clash with structure methods, struggle with proper generics, and become entirely impractical for union types (despite claims that it was "designed for union types").

A fun fact: this is not noticeable now. Because the current type system is very limited. But as soon as we add bool, fixed-width integers, nullability, structures, and generics, these problems will become significant.

The goal is to have predictable, explicit, and positionally-checked static typing. While Hindley-Milner is powerful, it's actually "type inference for the poor" — simple to implement when there's no time to fundamentally design the language.

Static typing (similar to TypeScript without any or Rust) is a must-have, even though implementing it is quite complex. Key aspects include:

variable types are inferred from declarations explicitly. var i = 0 is int (not "unify int" as now); var c = null is forbidden, use var c: int? = null; var a = b is okay since the type of b is known at that point
on every modification, type of a variable is validated, not unified
no auto types; function parameters must be strictly typed
function return types, if unspecified, inferred from return statements through a mechanism similar to unification; in case of recursion (direct or indirect), the return type must be explicitly declared somewhere
elimination of te_Indirect, each node's type will be directly inferred during analysis
no forall types, generic functions need to be resolved differently, since types are known during node analysis; saving a generic function into a variable is denied
generic parameters can be inside functions, like fun f<T>(a: T) { var b: [T] = [a]; }
thoroughly define what is null and how it interacts with assignments
constructions like t.tupleAt(0) (it's a generic method where T doesn't depend on arguments) should have "external hint" propagated, see below about generics
instead of unreadable unification errors, type mismatch should result in clear errors for assignment edges

Ideally, type inference should rely on a control flow graph, which we currently lack. It will be implemented later. For now, the existing AST representation will suffice.

Implementation details:

type of any expression is inferred and never changed
this is available because dependent expressions are already inferred
forall completely removed, generic functions introduced (they work like template functions actually, instantiated while inferring)
instantiation <...> syntax, example: t.tupleAt<int>(0)
as keyword, for example t.tupleAt(0) as int
methods binding is done along with type inferring, not before ("before", as worked previously, was always a wrong approach)

Clear and readable error messages on type mismatch

In FunC, due to Hindley-Milner, type mismatch errors are very hard to understand:

error: previous function return type (int, int) cannot be unified 
with implicit end-of-block return type (int, ()): cannot unify type () with int

After full reconsideration of the type system, they became human-readable:

1) can not assign `(int, slice)` to variable of type `(int, int)`
2) can not call method for `builder` with object of type `int`
3) can not use `builder` as a boolean condition
4) missing `return`
...

Generic functions `fun f<T>(...)` and instantiations like `f<int>(...)`

In FunC, there were "forall" functions:

forall X -> tuple tpush(tuple t, X value) asm "TPUSH";

In Tolk v0.6, the syntax changed to remind mainstream languages:

fun tuplePush<T>(mutate self: tuple, value: T): void
    asm "TPUSH";

But the change was only about the syntax. Under the hood, it was transformed to exactly the same representation, since forall was a part of the type system.

To replace Hindley-Milner type system, I had to implement support for generic functions. When f<T> is called, T is detected (in most cases) by provided arguments:

t.tuplePush(1);     // detected T=int
t.tuplePush(cs);    // detected T=slice
t.tuplePush(null);  // error, need to specify "null of what type"

The syntax f<int>(...) is also supported:

t.tuplePush<int>(1);     // ok
t.tuplePush<int>(cs);    // error, can not pass slice to int
t.tuplePush<int>(null);  // ok, null is "null of type int"

User-defined functions may also be generic:

fun replaceLast<T>(mutate self: tuple, value: T) {
    val size = self.tupleSize();
    self.tupleSetAt(value, size - 1);
}

Having called replaceLast<int> and replaceList<slice> will result in TWO generated asm (fift) functions. Actually, they mostly remind "template" functions. At each unique invocation, function's body is fully cloned under a new name.

There may be multiple generic parameters:

fun replaceNulls<T1, T2>(tensor: (T1, T2), v1IfNull: T1, v2IfNull: T2): (T1, T2) {
    var (a, b) = tensor;
    return (a == null ? v1IfNull : a, b == null ? v2IfNull : b);
}

A generic parameter T may be something complex.

fun duplicate<T>(value: T): (T, T) { 
    var copy: T = value;
    return (value, copy); 
}

duplicate(1);         // duplicate<int>
duplicate([1, cs]);   // duplicate<[int, slice]>
duplicate((1, 2));    // duplicate<(int, int)>

Or even functions, it also works:

fun callAnyFn<TObj, TResult>(f: (TObj) -> TResult, arg: TObj) { 
    return f(arg); 
}

fun callAnyFn2<TObj, TCallback>(f: TCallback, arg: TObj) { 
    return f(arg); 
}

Note, that while generic T are mostly detected from arguments, there are not so obvious corner cases, when T does not depend from arguments:

fun tupleLast<T>(self: tuple): T
    asm "LAST";

var last = t.tupleLast();    // error, can not deduce T

To make this valid, T should be provided externally:

var last: int = t.tupleLast();       // ok, T=int
var last = t.tupleLast<int>();       // ok, T=int
var last = t.tupleLast() as int;     // ok, T=int

someF(t.tupleLast());       // ok, T=(paremeter's declared type)
return t.tupleLast();       // ok if function specifies return type

Also note, that T for asm functions must occupy 1 stack slot (otherwise, asm body is unable to handle it properly), whereas for a user-defined function, T could be of any shape.

In the future, when structures and generic structures are implemented, all the power of generic functions will come into play. Implementing them now was a necessary step of getting rid of Hindley-Milner.

`bool` type, casting `boolVar as int`

With controlled type checking operating directly on the AST, it became be possible to introduce a proper bool type. Under the hood, bool is still -1 and 0 at TVM level, but from the type system's perspective, bool and int are now different.

Comparison operators == / >= /... return bool. Logical operators && || return bool. Constants true and false have the bool type. Lots of stdlib functions now return bool, not int (having -1 and 0 at runtime):

var valid = isSignatureValid(...);    // bool
var end = cs.isEndOfSlice();          // bool
var isHyphen = char == 45;            // bool

Operator !x supports both int and bool. Condition of if and similar accepts both int (!= 0) and bool. Logical && and || accept both bool and int, preserving compatibility with constructs like a && b where a and b are integers (!= 0).

Arithmetic operators are restricted to integers, only bitwise and logical allowed for bools:

valid && end;          // ok
valid & end;           // ok, bitwise & | ^ also work if both are bools
if (!end)              // ok

if (~end)              // error, use !end
valid + end;           // error
8 & valid;             // error, int & bool not allowed

This is a breaking change since in many real-world contracts, values previously treated as integers will now be booleans, and invalid operations on them will result in compilation errors.

The compiler does some optimizations for booleans. Example: boolVar == true -> boolVar. Example: !!boolVar -> boolVar. Example: !x for int results in asm 0 EQINT, but !x for bool results in asm NOT.

Note, that logical operators && || (missed in FunC) use IF/ELSE asm representation always. In the future, for optimization, they could be automatically replaced by & | when it's safe (example: a > 0 && a < 10). To manually optimize gas consumption, you can still use & | (allowed for bools), but remember, that they are not short-circuit.

Assigning bool to int is prohibited to avoid unintentional errors:

var isHyphen: int = char == 45;   // error, can not assign bool to int

If you really it, bool can be cast to int via as operator:

var i = boolValue as int;  // -1 / 0

There are no runtime transformations. bool is guaranteed to be -1/0 at TVM level, so this is type-only casting. But generally, if you need such a cast, probably you're doing something wrong (unless you're doing a tricky bitwise optimization).

Related pull requests

What's coming next?

I spent lots of time on creating the detailed Roadmap and preparing the compiler's kernel for future language changes. Finally, we'll reach structures with auto packing to/from cells.

There will be several publicly available releases while heading this way, mostly dedicated to type system enrichment and stack management. The next will be available quite soon, stay tuned.

This is a huge refactoring focusing on untangling compiler internals (previously forked from FunC). The goal is to convert AST directly to Op (a kind of IR representation), doing all code analysis at AST level. Noteable changes: - AST-based semantic kernel includes: registering global symbols, scope handling and resolving local/global identifiers, lvalue/rvalue calc and check, implicit return detection, mutability analysis, pure/impure validity checks, simple constant folding - values of `const` variables are calculated NOT based on CodeBlob, but via a newly-introduced AST-based constant evaluator - AST vertices are now inherited from expression/statement/other; expression vertices have common properties (TypeExpr, lvalue/rvalue) - symbol table is rewritten completely, SymDef/SymVal no longer exist, lexer now doesn't need to register identifiers - AST vertices have references to symbols, filled at different stages of pipeline - the remaining "FunC legacy part" is almost unchanged besides Expr which was fully dropped; AST is converted to Ops (IR) directly

FunC's (and Tolk's before this PR) type system is based on Hindley-Milner. This is a common approach for functional languages, where types are inferred from usage through unification. As a result, type declarations are not necessary: () f(a,b) { return a+b; } // a and b now int, since `+` (int, int) While this approach works for now, problems arise with the introduction of new types like bool, where `!x` must handle both int and bool. It will also become incompatible with int32 and other strict integers. This will clash with structure methods, struggle with proper generics, and become entirely impractical for union types. This PR completely rewrites the type system targeting the future. 1) type of any expression is inferred and never changed 2) this is available because dependent expressions already inferred 3) forall completely removed, generic functions introduced (they work like template functions actually, instantiated while inferring) 4) instantiation `<...>` syntax, example: `t.tupleAt<int>(0)` 5) `as` keyword, for example `t.tupleAt(0) as int` 6) methods binding is done along with type inferring, not before ("before", as worked previously, was always a wrong approach)

Comparison operators `== / >= /...` return `bool`. Logical operators `&& ||` return bool. Constants `true` and `false` have the `bool` type. Lots of stdlib functions return `bool`, not `int`. Operator `!x` supports both `int` and `bool`. Condition of `if` accepts both `int` and `bool`. Arithmetic operators are restricted to integers. Logical `&&` and `||` accept both `bool` and `int`. No arithmetic operations with bools allowed (only bitwise and logical).

Totally, v0.7 will include: - AST-level semantic kernel, transform AST to Ops directly - fully rewritten type system, drop Hindley-Milner - `bool` type support

tolk-vm added the Tolk Related to Tolk Language / compiler / tooling label Jan 13, 2025

This was referenced Jan 13, 2025

Tolk v0.7 wasm and stdlib ton-blockchain/tolk-js#2

Draft

Tolk v0.7 grammar and updates ton-blockchain/tolk-vscode#3

Draft

tolk-vm added 3 commits January 15, 2025 15:38

[Tolk] Bump version to v0.7

2997c02

Totally, v0.7 will include: - AST-level semantic kernel, transform AST to Ops directly - fully rewritten type system, drop Hindley-Milner - `bool` type support

tolk-vm force-pushed the tolk-v0.7 branch from 3ac5a3c to 2997c02 Compare January 15, 2025 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolk v0.7: overhaul compiler internals and the type system; `bool` type #1477

Tolk v0.7: overhaul compiler internals and the type system; `bool` type #1477

tolk-vm commented Jan 13, 2025 •

edited

Loading

Tolk v0.7: overhaul compiler internals and the type system; bool type #1477

Are you sure you want to change the base?

Tolk v0.7: overhaul compiler internals and the type system; bool type #1477

Conversation

tolk-vm commented Jan 13, 2025 • edited Loading

Notable changes in Tolk v0.7

Refactor and revamp compiler internals

Rewrite the type system from Hindley-Milner to static typing

Clear and readable error messages on type mismatch

Generic functions fun f<T>(...) and instantiations like f<int>(...)

bool type, casting boolVar as int

Related pull requests

What's coming next?

Tolk v0.7: overhaul compiler internals and the type system; `bool` type #1477

Tolk v0.7: overhaul compiler internals and the type system; `bool` type #1477

tolk-vm commented Jan 13, 2025 •

edited

Loading

Generic functions `fun f<T>(...)` and instantiations like `f<int>(...)`

`bool` type, casting `boolVar as int`