Overhaul the lexer parse and interpreter architecture to a better scheme #6

YJDoc2 · 2022-09-19T05:48:11Z

Currently the overall architecture of this is acceptable, but pretty hideous. The choices were made for various different (not necessarily correct) reasons at the time when this was written, but needs a fixing to some better choices now.

There is one lalrpop file which generates the "initial" parser, whose job is to take raw text input, make sure the syntax is correct, and generate list of label maps, data and code instructions ( as strings).
Next is interpreter which , again, takes the instructions as string, parses them and runs them.
Finally there is a print parser, dedicated to parse and run print instructions.

This is horrible for many reasons :

three parsers means it takes a lot of time to generate each from lalrpop file, pretty irritating for dev
print parser does not need to exist, its syntax is simple enough to manually lex and parse it
Because initial parser uses default lexer from lalrpop, which
- does not report \n, so we have to go through input before everything for making newline mapping
- because the way we define tokens using regexp , there are conflicts when we define two token which are overlapping, eg see issue blind student #5 (comment) , here the issue is the regexp for db text gobbles up everything upto the last quote (the last quote it can find, so it includes the next 'db' and the text as string). If we try to fix that regexp, it collides with string regexp. We cannot stop the db string at EOL, as lalrpop does not give access to \n
We really don't need to parse the instructions again from text for the interpreting. We can just use an enum to indicate instructions and store related params in it, and match on it , which will be a much better scheme overall.

Currently the two strategies are :

make a custom lexer which will be used with lalrpop as parse to do the initial parsing. The custom lexer will take care of newline mapping, as well as considering capital/small letters.
Make a custom lexer + (recursive decent?) parser, and remove the lalrpop dependency completely

Even though second option is desirable, it is equally tricky, so first shifting to custom lexer, than a custom parser separately might be a better way.

Either way, we should make the initial parser generate enum instead of text again and remove "interpreter parser", and remove print parser as well.

Tracking:

Remove Print parser with custom lexer+parser
Add lexer for "normal" asm , i.e. the main lexer (possibly integrate with print parser somehow?)
Integrate this lexer's token into lalrpop with custom token support which lalrpop provides, so at least the issue mentioned above can be mitigated in short term
Define enum for asm opcodes, so the original / lalrpop parser can (eventually) emit this instead of text
Port "initial" parser from emitting text instruction to the enum defined above, simultaneously port the "interpreter" from lalrpop to a giant match stmt on this enum values
Add a custom (recursive decent) parser for the "initial" parser, so that lalrpop dependency will completely removed. This is still up for discussion , need to see if that will actually provide any benefit , otherwise with custom tokens, the lalrpop parser file will be much simpler anyways.

Just noticed that the 8086 manual also include hex codes for instructions, if we can use them, we can actually store instructions in the memory and remove that barrier.

ADIthaker · 2022-09-21T06:40:49Z

I would love to work on this issue, can you send me some resources so I can learn the relevant parts of compiler design for this project.

YJDoc2 · 2022-09-21T07:45:58Z

Hey, Thanks a lot for your interest. https://craftinginterpreters.com/ is my go to suggestion for learning about compilers.

Do you have any preference for approach 1 or 2?

I have a basic lexer written for another project, which I can share in a gist, and you can adapt it for this. A good idea might be to start with printer interpreter, and convert it from lalrpop to hand-written stuff. Its pretty small and pretty eacy. They we can think about converting the rest.

What do you think?

ADIthaker · 2022-09-23T08:41:52Z

I think we can implement strat 1 first and then we can think of removing lalrpop completely. Starting with the print parser sounds good to me, will help me understand this project piece by piece better.

YJDoc2 · 2022-09-26T06:03:01Z

Great! So how do you want to proceed? Do you need any help?

YJDoc2 · 2023-03-09T06:06:14Z

FYI, ADIthaker will not be proceeding with this due to some personal issues.

Any reader who is interested can comment and take up parts 👍

YJDoc2 added enhancement New feature or request help wanted Extra attention is needed labels Sep 19, 2022

YJDoc2 pinned this issue Sep 21, 2022

YJDoc2 changed the title ~~Overhall the lexer parse and interpreter architecture to a better scheme~~ Overhaul the lexer parse and interpreter architecture to a better scheme Sep 21, 2022

YJDoc2 mentioned this issue Mar 9, 2023

next doesn't print some instructions correctly #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul the lexer parse and interpreter architecture to a better scheme #6

Overhaul the lexer parse and interpreter architecture to a better scheme #6

YJDoc2 commented Sep 19, 2022 •

edited

Loading

ADIthaker commented Sep 21, 2022

YJDoc2 commented Sep 21, 2022

ADIthaker commented Sep 23, 2022

YJDoc2 commented Sep 26, 2022

YJDoc2 commented Mar 9, 2023

Overhaul the lexer parse and interpreter architecture to a better scheme #6

Overhaul the lexer parse and interpreter architecture to a better scheme #6

Comments

YJDoc2 commented Sep 19, 2022 • edited Loading

ADIthaker commented Sep 21, 2022

YJDoc2 commented Sep 21, 2022

ADIthaker commented Sep 23, 2022

YJDoc2 commented Sep 26, 2022

YJDoc2 commented Mar 9, 2023

YJDoc2 commented Sep 19, 2022 •

edited

Loading