Skip to content

A compiler for a Java subset written in Rust, for RIT CSCI-742

Notifications You must be signed in to change notification settings

nicholastmosher/emjc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

emjc - Extended MiniJava Compiler

Build Status

This project was built as a course assignment for RIT's CSCI-742: compiler construction. It was my first exposure to compiler concepts and one of my earlier Rust projects, so it's not perfect by any means but I learned a lot from it.

Installation

This project is written in Rust, so if you'd like to build it from source you'll need to install the rust toolchain, which will include the compiler and the build tool, "cargo".

To build the project, simply run cargo build --release. This will output the binary into target/release/emjc. You can either directly execute that binary by cding into target/releases and running emjc <args>, or you can use cargo run --release -- <args> to compile and run all in one step.

Command line interface

If you execute emjc without any arguments, you'll see a help display:

$ cargo run
Compiling emjc v0.1.0 (file:///home/nick/Documents/classes/compilers/emjc)
    Finished dev [unoptimized + debuginfo] target(s) in 1.47 secs
    Running `target/debug/emjc`
emjc
USAGE:
    emjc <file>
FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information
ARGS:
    <file>

To see the output of the lexer, just provide a path to a .emj file.

cargo run -- my_emj_program.emj

Design

This project is split into two halves, a library and a binary. The core functionality of the compiler is written in the library, whose entry point is src/lib.rs. The binary half's entry point is src/bin/main.rs, and is in charge of interpreting command line arguments and loading any necessary files.

Within the library, the compiler implementation is further divided into "modules". The lexer is one module, and each subsequent subsystem will have one or more modules of its own.

Lexer

The Lexer implementation is located in src/lexer.rs. In it I make heavy use of Rust's "Regex" library as well as Rust enums for representing the Tokens. Enums in Rust can hold values, so any Tokens which have variable content (such as INTLIT, for example) simply store the string representation of the text they matched inside.

Parser

I originally started implementing the parser non-recursively, using a stack for pushing and popping symbols. It worked really well, until I realized I had no idea how to build the AST using this method. The remnants of that version of the parser are in src/syntax/stack_parser.rs and src/emj_grammar.rs.

I then switched gears and went for the recursive descent approach. This is the parser I'll be using from now on (or perhaps until I can make the stack parser work). It's in src/syntax/recursive_parser.rs. The AST is defined in src/syntax/ast.rs, and I've made visitor traits and a printer implementation which are in src/syntax/visitor/mod.rs and src/syntax/visitor/printer.rs, respectively.

Some known issues with the parser: currently I have a grammar ambiguity I have yet to resolve which can't tell the difference between a variable and an assignment statement. I also haven't implemented if statements yet. But I believe the rest should work fine, and the printer works as expected on a known good AST.

Name Analysis

The name analysis takes place in src/semantics/name_analysis.rs. The analysis is performed in two passes, one to declare all of the classes, functions, and variables, and another to verify that all the usages of these items are valid. Some data structures that were used are defined in src/semantics/mod.rs, and include GlobalScope, ClassScope, and FunctionScope.

One known issue right now is that while each identifier is correctly assigned a unique symbol name, for some reason there are places where the symbol number will increment by more than one. Besides this, all of the linking works correctly.

Type Analysis

The type analysis takes place in src/semantics/type_analysis.rs. When running the program with the --type option, all name errors and type errors are printed.

Code Generation

The only prerequisite to running the code generation phase is to have a version of the java runtime (JRE) installed. Upon being run with the --cgen switch, the compiler will check for a jasmin.jar file in the system's $HOME directory (~/jasmin.jar on UNIX-like systems). If ~/jasmin.jar does not already exist, the compiler has a version of jasmin bundled into the binary which it will write to disk so it can invoke it with java. This does require that Java is in the system's PATH.

When generating code, the compiler outputs one .j file per class declared in the program. These files are named after the class symbols, and get placed in a '.jasmin' directory created in the current directory when the compiler was executed. Then, the compiler automatically executes Jasmin and places the generated .class files into a directory called .class/, also created in the current directory. So the file structure after running the compiler with the --cgen option would look something like this:

my_program.emj
.jasmin/
  Main_0_.j
  Foo_6_.j
.class/
  Main_0_.class
  Foo_6_.class

To execute the compiled program, then, navigate to the .class directory and run java Main_0_.

Optimization

The control flow processing for the optimization stage is located in the directory src/control_flow/. The control flow graph data structure is defined in src/control_flow/mod.rs and is called Cfg.

The cfg command will output the .dot files representing the control flow graphs into a directory called .dot/ in the location where the command is executed. Each dot file is named after the name-analysis name for each function in each class of the program.

The optimization chosen for this assignment is live variable analysis. As an additional feature for this implementation, there is a command-line switch --cfgopt which will output an enhanced set of .dot graph files. The graphs produced with this switch are annotated with the variables which are live at each point in the graph.

Finally, as per the requirements, the --opt switch will produce bytecode generated directly from the control-flow graph (with dead variables eliminated), and --optinfo will print statistics for the bytecode count before and after optimizations.

Debugging info

If you're interested, you can view various debugging print statements by first running export RUST_LOG=emjc_lib=debug in your shell. Then each subsequent execution of the compiler will print information regarding steps in the compilation process. This is rather verbose, so it's recommended to keep this off unless you're curious about internal happenings. To turn debug messages back off, run export RUST_LOG=0.

Benchmarks

In the benches/resources folder I included all of the .emj files that were provided for reference. The benches/emjc_benches.rs file runs the lexer on each file and measures the time taken, reporting some statistics on the performance of emjc. To run the benchmarks, just run cargo bench.

About

A compiler for a Java subset written in Rust, for RIT CSCI-742

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages