-
Notifications
You must be signed in to change notification settings - Fork 0
/
Notes
148 lines (147 loc) · 9.45 KB
/
Notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
- A Map Of The Territory
- Scanning
- or lexing (lexical analysis)
- taking a string of characters and chunking them into a series of tokens
- ignores whitespace and comments
- Parsing
- where our syntax gets a grammar: composing larger expressions out of smaller parts
- it builds a tree structure that mirrors the nested nature of the grammar
- called a Parse Tree or an Abstract Syntax Tree (AST)
- Static Analysis
- first binding or resolution
- for each identifier, find out where it's defined and wire the two together
- this is where scope comes into play
- this is also where we type check if the language is statically typed
- then we store the results either back into the syntax tree or into a symbol table or into a new data structure
- Front End includes
- scanning, parsing, static analysis
- Intermediate representations
- a compiler is like a pipeline where each stage's job is to organise the data in a way that makes the next stage simpler to implement
- the front end of the pipeline is specific to the source language
- the back end is concerned with the CPU architecture
- in the middle the code may be stored in some intermediate representation
- having the same intermediate representations allows for having one front end for each language, then one back end for each target CPU architecture
- we can then mix and match those easily without having to rewrite everything for each specific cases
- Code generation
- generating the CPU instructions or VM bytecode
- it's called bytecode because each instruction is often a single byte long
- Virtual Machine
- a program that emulates a hypothetical chip supporting your virtual architecture at runtime
- it's slower because every instruction must be simulated at runtime
- [[Runtime]]
- implements the services that we need while the program is running
- for example, garbage collection or instance of tests, etc...
- in a compiled language the runtime gets inserted in the executable
- otherwise the runtime lives in the [[VM]] or the interpreter
- Shortcuts and Alternate Routes
- Single-pass compilers
- interleaves parsing, analysis and code generation
- no syntax trees or intermediate representation
- this restricts the design of the language
- for ex C or Pascal
- Tree-walk interpreters
- executes code right after parsing it to an AST
- the interpreter traverses the tree, evaluating each nodes as it goes to run the program
- it tends to be slow
- [[Transpiler]]s
- writing a front end for your language which produce source code for some other language
- then you can use the existing compilation tools of this other language
- old name was source-to-source compiler or transcompiler
- for ex there are tons of transpilers that target Javascript
- [[Just In Time Compilation]]
- what the JVM and most Javascript interpreters do
- it compiles the program to native code when loaded
- so it doesn't need to know the user computer architecture in advance
- [[Compiler]]s and [[Interpreter]]s
- What's the difference between a compiler and an interpreter ?
- by compiler we mean that it translates source code to some other form without executing it
- by interpreter we mean that it takes source code and executes it immediately
- What about CPython ?
- the code is parsed and converted into bytecode and run into a VM
- so it looks like an interpreter from the user perspective
- but there is also some compiling going on
- so it's both, it's an interpreter and it has a compiler
- In practice, most scripting languages are like that
- it's an interpreter and it has a compiler
- Go, Scala, JS V8, C#, etc...
- the second interpreter we'll build will be like that as well
- Difference between arguments and parameters
- an argument is the actual value you pass to a function when you call it
- so a function call has an argument list
- a parameter is a variable that holds the value of the argument inside the body of the function
- so a function declaration has a parameter list
- Why might any language want to be object oriented ?
- it's true that "all inheritance all the time" can create monstrous class hierarchies
- but objects are pretty handy, we need some way of defining compound data types
- having methods in objects let us avoid the need to prefix all our functions with the name of the data type they operate on
- [[Class]]es or [[Prototype]]s
- Classes came first (C++, Java, ...)
- Prototypes got popular with Javascript
- Class-based languages have two core concepts : instances and classes
- instances store the state for each object and have a reference to the class
- classes contain the methods and inheritance chain
- so to call a method on an instance, you lookup the instance's class and then find the method there
- Prototype-based languages merge the two concepts: there are only objects, no classes
- each object may contain state and methods
- objects can directly inherit from each other (delegate to)
- Building a Tree-Walk Interpreter
- Scanning
- Lexemes and Tokens
- lexemes are the raw substrings of the source code (`var` `=` `"something"` `;` etc...)
- a token is a bundle of a lexeme with other data about that lexeme
- Data found in a token
- Token type (keyword, operator, punctuation, etc...)
- Literal value
- Location information (line, column, length ...)
- Regular Languages and Expressions
- The core of the Scanner
- it's a loop
- it consumes characters until it emits a token
- and then loops back and does it again until it reaches the end of the input
- Lexical grammar
- it's the rules that determine how a particular language groups characters into lexemes
- so we can use a regex for each kind of lexeme and use those to match characters
- Recognizing Lexemes
- a lot of lexemes are single characters : `(), {} - + ; `etc...
- But single characters don't cover all operators
- `!` `=` `<` `>` can all be followed by another character that change the operator
- so we have to consider those as 2 characters lexemes
- Longer Lexemes
- Handling `/` for division
- needs a bit more handling because comments use the slash too
- when it's a comment, we keep going until the end of it without adding a token
- so the scanner basically ignores comments
- this makes me think that maybe it's what [[Brendan Eich]] did
- he basically checked the scanner rule for end of comment (here `\n`) and use that to inject his [[Javascript]] inside [[HTML]] ??
- Lookahead
- looking at coming character without consuming it
- our `peek()` function that has a one character lookahead
- more characters to lookahead, means a slower scanner
- most language only use a peek of one of two characters ahead
- New lines and whitespace
- whitespace is ` ` `\r` `\t` and new line is `\n`
- whitespace does nothing in the scanner
- new line just increments the line count in the scanner
- both don't add any tokens
- String literals
- Strings
- they always begin with `"`
- then we consume characters until we hit the ending `"`
- if we reach the end before the string is closed, we throw an error
- we also update `line` each time we hit a newline
- when we create the token, we also pass the value of the string, we didn't do that for other tokens
- Number literals
- only thing to note here is that looking past the decimal point requires a second character of lookahead
- because we don't want to consume the `.` until we know there is a digit after it
- then we use Java `Double.parse` method to pass the value to our token
- Reserved Words and Identifiers
- we can't match keywords like we did with multiple characters
- you can't emit a token directly when encountering `or`, it could also be a variable named `orchid`
- maximal munch
- when two lexical grammar rules can both match a chunk of code that the scanner is looking at
- whichever grammar matches the most characters wins
- so in the previous example, `orchid` would win
- it means that we can't detect a reserved word until we've reached the end of what might instead be an identifier
-
-
-