Any better way for whitespace to depend on context? #538

gwhitney · 2025-12-18T04:47:52Z

gwhitney
Dec 18, 2025

I am considering moving a handwritten parser into Ohm for improved extensibility/maintainability. One feature of the target language is that the top level consists of a list of expressions that are separated by newlines, but within delimited subexpressions (e.g. parenthesized expressions), newlines are allowed to occur and are ignored like all other whitespace. In the handwritten parser this feature is very conveniently handled via a tokenizer in which the whitespace recognizer checks the nesting level, and either includes or excludes newlines from whitespace accordingly.

However, so far as I can tell, there is no way to connect Ohm to a handwritten tokenizer and even if so, there's no api for the parsing state for such a possible tokenizer to query the "nesting level". In other words, there don't seem to be any hooks to provide this sort of context sensitivity in an Ohm grammar.

However, I have managed to get something like this working starting with the Arithmetic example grammar, via doubling the grammar using parameterized rules:

Arithmetic {

  Exps = NonemptyListOf<Exp<otherwhite>, newline>
  Exp<white>
    = white* AddExp<white> white*

  AddExp<white>
    = AddExp<white> white* "+" white* MulExp<white>  -- plus
    | AddExp<white> white* "-" white* MulExp<white>  -- minus
    | MulExp<white>

  MulExp<white>
    = MulExp<white> white* "*" white* ExpExp<white>  -- times
    | MulExp<white> white* "/" white* ExpExp<white>  -- divide
    | ExpExp<white>

  ExpExp<white>
    = PriExp<white> white* "^" white* ExpExp<white>  -- power
    | PriExp<white>

  PriExp<white>
    = "(" Exp<allwhite>")"  -- paren
    | "+" PriExp<white>   -- pos
    | "-" PriExp <white>  -- neg
    | ident
    | number

  ident  (an identifier)
    = letter alnum*

  number  (a number)
    = digit* "." digit+  -- fract
    | digit+             -- whole
    
  space := "UNUSED"
  
  allwhite = otherwhite | newline
  otherwhite = " " | "\t"
  newline = "\n"
}

This grammar seems to pretty much correctly parse the analogue for the Arithmetic language of this feature in my target language. In other words 3 +\n7 doesn't parse, but (3 +\n7) does. (I haven't thoroughly beaten on this grammar, but it seems basically right.)

But it is rather cumbersome to have given up the automatic whitespace ignoring of syntactic rules, and instead to be forced to explicitly write white* anywhere whitespace should be allowed. I was hoping that if I just named the parameter space I could then write something like

AddExp<space>
    = AddExp<space> "+"  MulExp<space>  -- plus
    | AddExp<space> "-" MulExp<space>  -- minus
    | MulExp<space>

and then the automatic whitespace ignoring would use the incoming space parameter as its definition of whitespace, as opposed to the ambient-grammar rule space. But that doesn't seem to work.

So my questions are
(1) Are there more convenient ways to implement this language feature, of newlines ignored inside of delimited subexpressions but significant at the "top level", that I may have missed?
(2) Alternatively, is there a reason that the value of a parameter named space can't or shouldn't then be used in the automatic space-ignoring feature of syntactic rules? If not, is changing Ohm to work that way a PR that might be entertained by the authors?
Thanks for any suggestions or guidance here.

Answered by joshmarinacci

Dec 19, 2025

I faced a similar issue with parsing Markdown and ended up parsing in two phases. The first phase chops the document up into blocks, and the second phase parses a block into individual lines and spans.

View full answer

joshmarinacci · 2025-12-19T04:25:01Z

joshmarinacci
Dec 19, 2025

I faced a similar issue with parsing Markdown and ended up parsing in two phases. The first phase chops the document up into blocks, and the second phase parses a block into individual lines and spans.

2 replies

gwhitney Dec 30, 2025
Author

In the end, I too opted for a simple grammar that tracks matched parentheses and quotes and a semantic action that changes the newlines that are inside parentheses but not inside quotes to an unassigned Unicode code point, which I then treat as whitespace in the main, second parsing phase. Since this is a character-for-character substitution, it does not disrupt the error messages (and I can just globally replace the unassigned code point with newline in the message to get back the original string that was supplied).

gwhitney Dec 30, 2025
Author

(But I will say that this solution is not quite as satisfying or convenient as a custom tokenizer or easy-to-use features for mild context-sensitivity that would not disrupt the parsing algorithm would be, so I would still encourage Ohm developers to think about such possibilities; sorry I don't have a crisp specific proposal along these lines.)

pdubroy · 2025-12-20T16:31:21Z

pdubroy
Dec 20, 2025
Maintainer

@gwhitney I think this is probably the best solution at the moment. FYI, to completely eliminate the implicit space skipping, you can change all of your rule names to begin with a lowercase letter. (Spaces are implicitly skipped in the body of syntactic rules, which are rules that begin with an uppercase letter. Rules that begin with a lowercase letter are lexical rules, and don't have implicit space skipping.)

We don't currently have any explicit support for a separate tokenization steps. However, you can write a tokenizer that produces a new string, which is then parsed by a separate grammar. Of course, to do that, you need to figure out how to represent the post-tokenization result as a string. Also, if it's not a 1-to-1 character mapping, then any errors that occur during the 2nd phase may need to be converted to the proper line/column in the original string.

E.g., your original input text might be def foo():\n blah and then the post-tokenization string might be: def foo():\x0F\nblah\0x0E (\x0F is ASCII "shift in" and \x0E is "shift out"). So if blah is an unrecognized identifier, it appears at offset 12 in the 2nd string, but offset 15 in the original string.

I'm open to adding features to improve these issues. I think it would be nice to have some explicit support for separate tokenization/preprocessing steps.

1 reply

gwhitney Dec 21, 2025
Author

Well, one conceivable possibility would be to allow "rules" that called back a javascript function with an index into the string being parsed, and the callback could return a greater or equal index indicating the end of the match if that "rule" should be considered to match at the current location, and -1 (say) if they should be considered not to match. So then my tokens "NUMERAL", "SYMBOL", "WHITESPACE", etc, could be recognized by such callbacks.

But that would really only solve the specific higher-level whitespace-depends-on-nesting-level problem if the information given to the callback included not only the current index in the string, but also at least something about the current parse state. And that in turn would mean the memoization of whether the callback-rule matches at that position would have to also depend on the parse state, which presumably would have ripple effects back up the parse stack. So not being familiar with the details of the enhanced packrat parsing in Ohm, I can't really say whether such a facility would be feasible in Ohm.

However, from the basics I have gleaned, it seems that at least "callback rules" that only receive the index should be compatible with Ohm's parsing. I'm just not 100% sure they're worth the effort of implementing, since it's really the context-sensitivity I'm feeling the lack of; the basic token types I'm using can be defined with Ohm lexical rules.

But your suggestion to, e.g., transduce nested "\n" into "\r", say, before the main parsing step is not a bad one. It's probably worth the resulting grammar simplification. I'll give it a try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Any better way for whitespace to depend on context? #538

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Any better way for whitespace to depend on context? #538

Uh oh!

gwhitney Dec 18, 2025

Replies: 2 comments · 3 replies

Uh oh!

joshmarinacci Dec 19, 2025

Uh oh!

gwhitney Dec 30, 2025 Author

Uh oh!

gwhitney Dec 30, 2025 Author

Uh oh!

pdubroy Dec 20, 2025 Maintainer

Uh oh!

gwhitney Dec 21, 2025 Author

gwhitney
Dec 18, 2025

Replies: 2 comments 3 replies

joshmarinacci
Dec 19, 2025

gwhitney Dec 30, 2025
Author

gwhitney Dec 30, 2025
Author

pdubroy
Dec 20, 2025
Maintainer

gwhitney Dec 21, 2025
Author