Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul the lexer parse and interpreter architecture to a better scheme #6

Open
6 tasks
YJDoc2 opened this issue Sep 19, 2022 · 5 comments
Open
6 tasks
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@YJDoc2
Copy link
Owner

YJDoc2 commented Sep 19, 2022

Currently the overall architecture of this is acceptable, but pretty hideous. The choices were made for various different (not necessarily correct) reasons at the time when this was written, but needs a fixing to some better choices now.

  • There is one lalrpop file which generates the "initial" parser, whose job is to take raw text input, make sure the syntax is correct, and generate list of label maps, data and code instructions ( as strings).
  • Next is interpreter which , again, takes the instructions as string, parses them and runs them.
  • Finally there is a print parser, dedicated to parse and run print instructions.

This is horrible for many reasons :

  • three parsers means it takes a lot of time to generate each from lalrpop file, pretty irritating for dev
  • print parser does not need to exist, its syntax is simple enough to manually lex and parse it
  • Because initial parser uses default lexer from lalrpop, which
    • does not report \n, so we have to go through input before everything for making newline mapping
    • because the way we define tokens using regexp , there are conflicts when we define two token which are overlapping, eg see issue blind student #5 (comment) , here the issue is the regexp for db text gobbles up everything upto the last quote (the last quote it can find, so it includes the next 'db' and the text as string). If we try to fix that regexp, it collides with string regexp. We cannot stop the db string at EOL, as lalrpop does not give access to \n
  • We really don't need to parse the instructions again from text for the interpreting. We can just use an enum to indicate instructions and store related params in it, and match on it , which will be a much better scheme overall.

Currently the two strategies are :

  • make a custom lexer which will be used with lalrpop as parse to do the initial parsing. The custom lexer will take care of newline mapping, as well as considering capital/small letters.
  • Make a custom lexer + (recursive decent?) parser, and remove the lalrpop dependency completely

Even though second option is desirable, it is equally tricky, so first shifting to custom lexer, than a custom parser separately might be a better way.

Either way, we should make the initial parser generate enum instead of text again and remove "interpreter parser", and remove print parser as well.

Tracking:

  • Remove Print parser with custom lexer+parser
  • Add lexer for "normal" asm , i.e. the main lexer (possibly integrate with print parser somehow?)
  • Integrate this lexer's token into lalrpop with custom token support which lalrpop provides, so at least the issue mentioned above can be mitigated in short term
  • Define enum for asm opcodes, so the original / lalrpop parser can (eventually) emit this instead of text
  • Port "initial" parser from emitting text instruction to the enum defined above, simultaneously port the "interpreter" from lalrpop to a giant match stmt on this enum values
  • Add a custom (recursive decent) parser for the "initial" parser, so that lalrpop dependency will completely removed. This is still up for discussion , need to see if that will actually provide any benefit , otherwise with custom tokens, the lalrpop parser file will be much simpler anyways.

Just noticed that the 8086 manual also include hex codes for instructions, if we can use them, we can actually store instructions in the memory and remove that barrier.

@YJDoc2 YJDoc2 added enhancement New feature or request help wanted Extra attention is needed labels Sep 19, 2022
@ADIthaker
Copy link

I would love to work on this issue, can you send me some resources so I can learn the relevant parts of compiler design for this project.

@YJDoc2
Copy link
Owner Author

YJDoc2 commented Sep 21, 2022

Hey, Thanks a lot for your interest. https://craftinginterpreters.com/ is my go to suggestion for learning about compilers.

Do you have any preference for approach 1 or 2?

I have a basic lexer written for another project, which I can share in a gist, and you can adapt it for this. A good idea might be to start with printer interpreter, and convert it from lalrpop to hand-written stuff. Its pretty small and pretty eacy. They we can think about converting the rest.

What do you think?

@YJDoc2 YJDoc2 pinned this issue Sep 21, 2022
@YJDoc2 YJDoc2 changed the title Overhall the lexer parse and interpreter architecture to a better scheme Overhaul the lexer parse and interpreter architecture to a better scheme Sep 21, 2022
@ADIthaker
Copy link

I think we can implement strat 1 first and then we can think of removing lalrpop completely. Starting with the print parser sounds good to me, will help me understand this project piece by piece better.

@YJDoc2
Copy link
Owner Author

YJDoc2 commented Sep 26, 2022

Great! So how do you want to proceed? Do you need any help?

@YJDoc2
Copy link
Owner Author

YJDoc2 commented Mar 9, 2023

FYI, ADIthaker will not be proceeding with this due to some personal issues.

Any reader who is interested can comment and take up parts 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants