Any better way for whitespace to depend on context? #538
-
|
I am considering moving a handwritten parser into Ohm for improved extensibility/maintainability. One feature of the target language is that the top level consists of a list of expressions that are separated by newlines, but within delimited subexpressions (e.g. parenthesized expressions), newlines are allowed to occur and are ignored like all other whitespace. In the handwritten parser this feature is very conveniently handled via a tokenizer in which the whitespace recognizer checks the nesting level, and either includes or excludes newlines from whitespace accordingly. However, so far as I can tell, there is no way to connect Ohm to a handwritten tokenizer and even if so, there's no api for the parsing state for such a possible tokenizer to query the "nesting level". In other words, there don't seem to be any hooks to provide this sort of context sensitivity in an Ohm grammar. However, I have managed to get something like this working starting with the Arithmetic example grammar, via doubling the grammar using parameterized rules: This grammar seems to pretty much correctly parse the analogue for the Arithmetic language of this feature in my target language. In other words But it is rather cumbersome to have given up the automatic whitespace ignoring of syntactic rules, and instead to be forced to explicitly write and then the automatic whitespace ignoring would use the incoming space parameter as its definition of whitespace, as opposed to the ambient-grammar rule So my questions are |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
|
I faced a similar issue with parsing Markdown and ended up parsing in two phases. The first phase chops the document up into blocks, and the second phase parses a block into individual lines and spans. |
Beta Was this translation helpful? Give feedback.
-
|
@gwhitney I think this is probably the best solution at the moment. FYI, to completely eliminate the implicit space skipping, you can change all of your rule names to begin with a lowercase letter. (Spaces are implicitly skipped in the body of syntactic rules, which are rules that begin with an uppercase letter. Rules that begin with a lowercase letter are lexical rules, and don't have implicit space skipping.) We don't currently have any explicit support for a separate tokenization steps. However, you can write a tokenizer that produces a new string, which is then parsed by a separate grammar. Of course, to do that, you need to figure out how to represent the post-tokenization result as a string. Also, if it's not a 1-to-1 character mapping, then any errors that occur during the 2nd phase may need to be converted to the proper line/column in the original string. E.g., your original input text might be I'm open to adding features to improve these issues. I think it would be nice to have some explicit support for separate tokenization/preprocessing steps. |
Beta Was this translation helpful? Give feedback.
I faced a similar issue with parsing Markdown and ended up parsing in two phases. The first phase chops the document up into blocks, and the second phase parses a block into individual lines and spans.