Refactor `swayfmt` to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

ironcev · 2024-12-09T09:16:19Z

In the (near) future, we will want to develop tools that will analyze and transform Sway source code. Two typical examples are:

the forc migrate tool that will migrate existing source code to new versions of Sway, assisting at breaking-changes.
Sway's equivalent of the Rust's Clippy tool.

Such tools usually operates on the AST, traversing or pattern matching it during the analysis, and transforming it to produce the changed source code.

Sway trees (lexed, parsed, and typed) are not full-fidelity trees in the sense that they do not carry the full source information (e.g. trivia like white-spaces and comments). Also, manipulating the trees "manually" in a program, does not update the spans already contained in those trees.

Supporting full-fidelity trees would require a significant change in the compiler. This issue describes a simpler solution for generating changed code from altered trees.

Although the lexed tree is the least convenient for analysis and tree manipulation, it allows us to reuse swayfmt for code generation, assuming the swayfmt implementation gets changed, as proposed below.

To get the idea about the kind of changes, e.g., the forc migrate tool will be doing in the lexed tree, here are a few concrete examples.

References will require changing self function parameters to &self: fn method(self) -> fn method(&self).
References will require changing ref mut function parameters to &mut: fn fun(ref mut x: MyType) -> fn method(x: &mut MyType).
Supporting PartialEq and Eq will require inserting trait implementations, e.g., impl Eq for MyType {}.

In short, we can have new elements in the tree entered, elements removed, renamed, etc.

The proposal for obtaining the altered source code from the altered lexed tree is to reuse the swayfmt. It requires adapting the implementation of the swayfmt to not necessarily parse the code from files, but to take any lexed tree as input, no matter how the tree is created, or if there is actual existing source code behind it.

The issue with getting formatted code generated by swayfmt from an arbitrary lexed tree lays in the current implementation of the swayfmt that expects that every element of the lexed tree has a valid Span that points to the valid text in the source code, that contains the textual representation of that lexical element. This is done also in the cases of keywords and e.g. punctuation where the text is actually known. Essentially, when rendering the formatted output, the text is always taken from the spans, means from the original source code.

E.g., referencing like &mut <type> is formatted this way:

write!(
    formatted_code,
    "{}{}",
    ampersand_token.span().as_str(),  // <<<--- Taking the string "&" from the source code, over `span()`.
    if let Some(mut_token) = mut_token {
        format!("{} ", mut_token.span().as_str())  // <<<--- Taking the string "mut" from the source code, over `span()`.
    } else {
        "".to_string()
    },
)?;

E.g., Idents are formatted this way (in the example below, self is Ident):

write!(formatted_code, "{}", self.span().as_str())?  // <<<--- Taking the ident from the source code, over `span()`.

The problem is, that in the case of the altered trees, the existing spans can become obsolete or completely meaningless (e.g., ref mut is removed from the tree), and the new elements will not have their textual representation in the source file at all (e.g., the inserted &mut).

The proposed solution to the problem is fairly straightforward. Instead of taking the strings from the spans/source code, for punctuations, keywords, etc. we can just use their string values, and for Idents the as_str() method, which will use the overwritten ident, if available.

Essentially, we would have something like:

write!(
    formatted_code,
    "{}{}",
    AmpersandToken::AS_STR,  // AS_STR is an str constant `&`;
    if let Some(mut_token) = mut_token {
        format!("{} ", MutToken::AS_STR) // <<<--- AS_STR is an str constant `mut`;
    } else {
        "".to_string()
    },
)?;

write!(formatted_code, "{}", self.as_str())?

I've played with this proposal quite some, and tried out various combinations of altered lexed trees and I can confirm that everything what is needed by the forc migrate tool can we achieved if we do the above change in the implementation of the swayfmt. Once available, this approach of altering trees and getting source code out of them can be used in all tools that want to structurally change the code. E.g., our own Clippy equivalent one day.

To produce altered trees that will play well with such changed swayfmt, one will still need to provide "suitable faked spans". But this will also be straightforward, and we will provide rules for obtaining such spans. Also, altering lexed trees can we supported with an API that will encapsulate the rules.

The proposed change shouldn't affect how the swayfmt operates, and all the existing clients, like forc-fmt, will remain unchanged.

The text was updated successfully, but these errors were encountered:

tritao · 2024-12-09T19:17:25Z

Re-using swayfmt sounds like a good idea, but long term I think we may want to consider extract some of the more generic code and introducing a new general library for AST rewriting, some references for possible inspiration from Clang below.

https://clang.llvm.org/docs/LibTooling.html
https://clang.llvm.org/docs/ClangTransformerTutorial.html

ironcev · 2024-12-18T10:27:27Z

Thanks for the links @tritao! As someone who wrote a lot of static analysis code I couldn't agree more! The approach taken in #6790 is motivated by pragmatic reasons, to get a tool of production quality with acceptable effort. Using low level compiler API (lexed and typed trees) for analysis and swayfmt for code generation is surely not optimal.

On the long term, I would love to have:

a high-level abstraction for matching parts of the AST. Something like Clang AST matchers or even nicer Clippy syntax tree patterns proposal. Matching AST nodes using low-level APIs is, in my experience, cumbersome, time-consuming, and error-prone.
API for rewriting AST. E.g., like the proposed Clang Transformer. I've used Roslyn's builder pattern a lot in the past and found it good. Sometimes too verbose though 😄
Generating code out of the modified AST, without the need to manually construct the lexed tree (as it is the case in Add forc-migrate tool #6790, with the swayfmt reuse).

ironcev self-assigned this Dec 9, 2024

ironcev added the formatter label Dec 9, 2024

ironcev mentioned this issue Dec 16, 2024

Add forc-migrate tool #6790

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `swayfmt` to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

Refactor `swayfmt` to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

ironcev commented Dec 9, 2024

tritao commented Dec 9, 2024

ironcev commented Dec 18, 2024

Refactor swayfmt to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

Refactor swayfmt to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

Comments

ironcev commented Dec 9, 2024

tritao commented Dec 9, 2024

ironcev commented Dec 18, 2024

Refactor `swayfmt` to generate code from arbitrary lexed trees, not necessarily backed by source code #6779

Refactor `swayfmt` to generate code from arbitrary lexed trees, not necessarily backed by source code #6779