Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add context free proto file grammar #2

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ObsidianMinor
Copy link

@ObsidianMinor ObsidianMinor commented Aug 24, 2019

This repository is looking lacking, so I thought I'd get it started. I also plan to write specs for proto2 and proto3, but in the meantime I decided to write this context free grammar to document a grammar that works for every proto file regardless of whether the syntax is valid in context (like incorrect syntax versions or invalid values for certain options)

spec/grammar.ebnf Show resolved Hide resolved

end_statement = ";" ;

aggregate_literal = "{" , { identifier , ":" , literal } , "}" ; (* whitespace and or comments seperate each field value *)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a little tricky. Aggregate values referenced by an option must start with {. But what goes inside the braces is actually the protobuf text format, which is much more lenient than what you have here.

In particular:

  • You can use < and > around nested message values (not just { and })
  • The colon is optional for nested messages: foo { a: 1 } is valid
  • Repeated fields can use array-like notation: foos: ["a", "b", "c"] is valid
  • The identifier is not just a simple identifier -- it is a fully-qualified identifier and can optionally be surrounded by brackets or parentheses ([name] or (name)) to indicate that it is the name of an extension, nor a regular field.

import = "import" , ["weak" | "public"] , string_literal , end_statement ;
package = "package" , full_identifier , end_statement ;

option_name = (identifier | "(" , full_identifier , ")") , {"." , identifier } ;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trailing portions are defined here only as identifier, but they are also allowed to be "(", full_identifier, ")", to indicate extension fields inside of a nested message.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking the parser source code, it appears that an easier way to represent it would be to have an at-least one name_part rule like so.

name_part = identifier | ("(" , full_identifier , ")") ;
option_name = name_part , { "." , name_part } ;

If I'm correct then this allows for everything the parser allows, including

foo // simple identifiers
foo.bar // full identifiers
(foo).bar // extensions
(foo.bar).baz // extensions with full identifiers
(foo.bar).(baz) // extensions on extensions
(foo).(bar).(baz) // deep extensions on extensions

;

identifier = letter , { letter | decimal_digit | "_" } ;
full_identifier = identifier , { "." , identifier } ;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most places that accept a full_identifier do in fact support a preceding ".", to explicitly indicate an absolute fully-qualified name (e.g. not relative to the current context). This includes references to custom option names.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this works for custom option names (at least anymore). I have a rule for those cases however where you can have a preceding dot called type_name and it's used in areas where preceding dot is allowed.

extend = "extend" , identifier , "{" , { field | group | end_statement } , "}" ;

rpc = "rpc" , identifier , "(" , ["stream"] , type_name , ")" , "returns" , "(" , ["stream"] , type_name , ")" , (("{" , { option , end_statement } , "}") | end_statement) ;
stream = "stream" , identifier , "(" , type_name , "," , type_name , ")" , (("{" , { option , end_statement } , "}") | end_statement) ;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is indicated in one of the grammars listed on the main docs site, protoc does not actually accept it. So it should be removed (from this grammar and from the docs site). I'm pretty sure this was transcribed from an internal version of protoc to support earlier versions of streaming stubby (the internal RPC framework at Google, pre-cursor to gRPC).

Streams are instead defined only with rpc keyword and the use of stream next to request or response type (or both(.

 * Allow seperated string literals
 * Expand upon aggregate literals format
 * Fixed incorrect option name rule
 * Removed stream service method type
@derekperkins
Copy link

Was there a reason this never got merged?

@teijeong
Copy link

+1, Was there a reason this never got merged?

@Logofile
Copy link
Member

Logofile commented Feb 1, 2022

ObsidianMinor, we were going to turn down this repository when we discovered this outstanding PR. Sorry for the losing track of this!

Would you mind redirecting the location for the file to https://github.com/protocolbuffers/protobuf/blob/master/docs/grammar.ebnf? I'll work to get the doc reviewed by our engineering team in the meantime, in case any changes are needed before we can accept the PR.

@jhump
Copy link

jhump commented Feb 1, 2022

FWIW, here's another alternative that describes the language, also in EBNF.
https://github.com/jhump/protocompile/blob/master/grammar/README.md

Unlike this document, it separates lexical analysis from the rest of the grammar productions in order to describe the nuance in tokenization relating to handling of whitespace and comments. It also has a slightly different way to interpret numeric literals, in an attempt to codify some of protoc's behavior. For example, in the face of "1to1000", the grammar in this PR suggests a tokenization of "1", "to", "1000", but protoc considers this a syntax error.

I have high confidence in the grammar as it is based on a yacc grammar that powers what I think might be the closest implementation to protoc itself.

@Logofile
Copy link
Member

Logofile commented Feb 1, 2022

Joshua, if you'd like to submit a PR for the same target location (so we can get all of the CLAs and such covered), we'd love to take a look at adding it to our docs. I'm still coming up-to-speed on EBNF, so I'll need to get some SWE eyes on it once the PR comes in if you do submit one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants