Nearley parser for address format #13

baudehlo · 2017-02-11T19:46:52Z

This allows us to support all mail formats, including groups.

Includes all changes up to RFC 6854.

codecov-io · 2017-02-11T19:46:52Z

Codecov Report

Merging #13 into master will increase coverage by 4.55%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
+ Coverage   89.47%   94.02%   +4.55%     
==========================================
  Files           1        3       +2     
  Lines         171      251      +80     
  Branches       44       49       +5     
==========================================
+ Hits          153      236      +83     
+ Misses         18       15       -3

Impacted Files	Coverage Δ
lib/flatten.js	`91.3% <91.3%> (ø)`
index.js	`90.9% <96.42%> (+1.43%)`	✅
lib/address_format.js	`97.45% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b00588e...1b015f7. Read the comment docs.

msimerson

This morning I looked at this for a bit. I read the docs and a couple examples for nearley and it looks like a great tool. The problem with bodg [email protected] is annoying but I couldn't pin down where that issue is.

So I looked at it from the other end. Did Mail::Address actually return anything useful from that format?

$ perl -e 'use Data::Dumper; use Mail::Address; print Data::Dumper::Dumper(Mail::Address->parse("bodg [email protected]"));'
$VAR1 = bless( [
                 '',
                 'bodg',
                 ''
               ], 'Mail::Address' );
$VAR2 = bless( [
                 '',
                 'fred.ti.com',
                 ''
               ], 'Mail::Address' );

Nope. I spot checked a few others, just for fun against Mail::Address and, uh, I think there's a good reason it's no longer being developed.

So then I thought to myself, node-address-rfc2822 is a pretty lousy name for a module that's attempting to implement RFC 5322 & 6854. Perhaps it should have a more general name, like email-address or some such. And, there should be a more up-to-date set of valid email addresses one can test their parser against...

So then I found email-addresses on NPM. And part of its test suite is the corpus from is_email which has a spiffy web site where you can test email addresses against their validator. Nifty.

There's another set of addresses here that would be good to drop into the test suite. At the very least, it'll increase our confidence in the parser.

baudehlo · 2017-02-11T22:37:22Z

Yeah I checked Email::Address too, which is more up to date. It fails on that address too. So I just stripped it from the test suite (along with a few others). I don't know about renaming it - people know it as rfcs 821 and 822 pretty well, so it's not a terrible name. And given email::addresses is used, I can't see what to do. Most of those tests (particularly the is_email ones) are rfc 821 addresses, so won't help extend the test suite much. Mostly I'd like feedback on the concept of switching to a grammar, rather than hacky regexps. The grammar is probably slower, but it seems to parse everything reasonably quickly now.

…

On Sat, Feb 11, 2017 at 4:29 PM, Matt Simerson ***@***.***> wrote: ***@***.**** commented on this pull request. This morning I looked at this for a bit. I read the docs and a couple examples for *nearley* and it looks like a great tool. The problem with bodg ***@***.*** is annoying but I couldn't pin down where that issue is. So I looked at it from the other end. Did Mail::Address actually return anything useful from that format? $ perl -e 'use Data::Dumper; use Mail::Address; print Data::Dumper::Dumper(Mail::Address->parse("bodg ***@***.***"));'$VAR1 = bless( [ '', 'bodg', '' ], 'Mail::Address' );$VAR2 = bless( [ '', 'fred.ti.com', '' ], 'Mail::Address' ); Nope. I spot checked a few others, just for fun against Mail::Address and, uh, I think there's a good reason it's no longer being developed. So then I thought to myself, node-address-rfc2822 is a pretty lousy name for a module that's attempting to implement RFC 5322 & 6854. Perhaps it should have a more general name, like email-address or some such. And, there should be a more up-to-date set of valid email addresses one can test their parser against... So then I found email-addresses <https://github.com/jackbowman/email-addresses> on NPM. And part of its test suite is the corpus from is_email <https://github.com/dominicsayers/isemail> which has a spiffy web site <https://isemail.info> where you can test email addresses against their validator. Nifty. There's another set of addresses here <https://github.com/snoj/email-validation/blob/master/tests/test-addresses.js> that would be good to drop into the test suite. At the very least, it'll increase our confidence in the parser. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAobY_r6WUauTT1ghT69UEeVZU3LxZFbks5rbihTgaJpZM4L-Rq9> .

msimerson · 2017-02-11T22:51:21Z

Mostly I'd like feedback on the concept of switching to a grammar, rather
than hacky regexps. The grammar is probably slower, but it seems to parse
everything reasonably quickly now.

I'm thinking probably not too far down the road someone will want Haraka to join Postfix, gmail, and the other modern SMTP agents that support SMTPUTF8. That will involve:

RFC 6530 Overview and Framework for Internationalized Email
RFC 6531 SMTP Extension for Internationalized Email
RFC 6532 Internationalized Email Headers

When that day arrives, are we better off with nearley or a more purpose-built implementation like the one in email-addresses.js? (That's an honest question, I don't know the answer.) From out here in the cheap seats, a really awesome feature of the purpose-build parser is that we can get back useful error messages about specifically why an address is invalid.

Performance would be interesting to compare, but so long as neither is incredibly slow, any implementation should be good enough.

baudehlo · 2017-02-11T23:48:06Z

I've added those other tests in btw, to the best of my ability. I'm not sure if an out of the box parser or nearley is better. In my experience a hand-built one is faster, but if you look at the internals of email-addresses, it's almost identical to what a parser would do (and is built the same way). I suspect using nearley makes it easier to support newer changes in the future, and the error reporting isn't terrible.

…

On Sat, Feb 11, 2017 at 5:51 PM, Matt Simerson ***@***.***> wrote: Mostly I'd like feedback on the concept of switching to a grammar, rather than hacky regexps. The grammar is probably slower, but it seems to parse everything reasonably quickly now. I'm thinking probably not too far down the road someone will want Haraka to join Postfix, gmail, and the other modern SMTP agents) that support SMTPUTF8. That will involve: - RFC 6530 <https://tools.ietf.org/html/rfc6530> Overview and Framework for Internationalized Email - RFC 6531 <https://tools.ietf.org/html/rfc6531> SMTP Extension for Internationalized Email - RFC 6532 <https://tools.ietf.org/html/rfc6532> Internationalized Email Headers When that day arrives, are we better off with nearley or a more purpose-built implementation like the one in email-addresses.js <https://github.com/jackbowman/email-addresses>? (That's an honest question, I don't know the answer.) From out here in the cheap seats, a really awesome feature of the purpose-build parser is that we can get back useful error messages about specifically why an address is invalid. Performance would be interesting to compare, but so long as neither is incredibly slow, any implementation should be good enough. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAobY9HiuCDm6Zjqf3D6J8_UjPLyGDzyks5rbjtpgaJpZM4L-Rq9> .

msimerson

+1

msimerson · 2017-02-11T23:54:20Z

It might be nice to update the README, and specify precisely which RFCs we expect this parser currently supports, and perhaps which ones we mostly (I'm thinking legacy here) support, and which ones we anticipate supporting in the future (internationalization, maybe?).

baudehlo · 2017-02-12T21:02:15Z

Thinking about this a bit more - maybe nearley is the wrong answer for us. It had exponentially bad behaviour on longer strings and a recursive descent parser is probably better. We may be better off using email-addresses as a basis anyway and just use our module to keep the interface the same.

…

On Feb 11, 2017, at 7:41 PM, Matt Simerson ***@***.***> wrote: Merged #13. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

msimerson · 2017-02-12T22:42:30Z

We may be better off using email-addresses as a basis anyway

That's the way this cat leans. I looked through the code in email-addresses and its clean, tidy, well designed, and well tested. It looks straight forward to modify and maintain. It has exponentially more users making it likely to be well maintained into the distant future.

use our module to keep the interface the same.

I wouldn't even do that. There's only one other dependent on this module (besides Haraka). I'd update Haraka to use email-addresses and slap a great big "no longer maintained, we've moved to email-addresses and suggest that you do too" sign on the README.

baudehlo · 2017-02-12T22:46:20Z

Well for one, address-rfc2822 does much nicer things with the format() method. I don't think deprecating that completely is the right thing to do. Perhaps merging that code into email-addresses would be better?

…

On Sun, Feb 12, 2017 at 5:42 PM, Matt Simerson ***@***.***> wrote: We may be better off using email-addresses as a basis anyway That's the way this cat leans. I looked through the code in email-addresses and its clean, tidy, well designed, and well tested. It looks straight forward to modify and maintain. It has exponentially more users making it likely to be well maintained into the distant future. use our module to keep the interface the same. I wouldn't even do that. There's only one other dependent on this module (besides Haraka). I'd update Haraka to use email-addresses and slap a great big "no longer maintained, we've moved to email-addresses and suggest that you do too" sign on the README. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAobY22nzYLsfAyA8F7Me_S00oA_psMvks5rb4rXgaJpZM4L-Rq9> .

msimerson · 2017-02-12T23:05:17Z

Perhaps merging that code into email-addresses would be better?

Yes, that's exactly what I was thinking. It could even be a separate API file over there that wraps lib/email-addresses.js and externally returns the same responses as this module. Then this module is no longer necessary.

msimerson · 2017-02-12T23:13:17Z

Of course, it's also possible that the nice formatting stuff we do (I had forgotten that, and I'm kinda fond of nameCase) would be considered out-of-scope over there. If merging in those features and their syntactic sugar is not possible, maybe as you said the better approach is just use email-addresses as our parser and continue maintaining this module. 🤷‍♂️

baudehlo · 2017-02-12T23:34:17Z

Let's ask the author. @jackbowman what are your thoughts?

…

On Feb 12, 2017, at 6:13 PM, Matt Simerson ***@***.***> wrote: Of course, it's also possible that the nice formatting stuff we do (I had forgotten that, and I'm kinda fond of nameCase) would be considered out-of-scope over there. If merging in those features and their syntactic sugar, maybe as you said the better approach is just use email-addresses as our parser and continue maintaining this module. 🤷‍♂️ — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

msimerson · 2017-02-13T00:34:26Z

Also, email-addresses already has RFC-6532 support.

baudehlo · 2017-02-13T01:34:48Z

Yep. I didn't see a way to switch to Sender or Reply-To mode. And also not sure if it would parse Unicode content (the strict parser I first made did not).

…

On Feb 12, 2017, at 7:34 PM, Matt Simerson ***@***.***> wrote: Also, email-addresses already has RFC-6532 support. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jackbearheart · 2017-02-17T15:42:07Z

Hi folks. You're wondering if "pretty formatting" is out of scope from email-addresses, correct? I lean towards, yes, it's out of scope, as email-addresses is currently just a parser. I don't know exactly what you intend though so if you have an example I would be curious to hear it.

One thing email-addresses can currently do is pull the name and address out of the string, so maybe if we had jackbearheart/email-addresses#21 then "pretty printing" would be just pulling out the name and address from the parse and passing it to that function? That could be trivial and I would accept that kind of PR.

(Thanks for interest in the parser side and I'm very happy to accept PRs for the parser.)

jackbearheart · 2017-02-17T17:09:53Z

One more note I forgot, since I saw you discussing performance. To be perfectly honest I have never evaluated the performance of email-addresses, so I suggest you test it for your use case. I've always wondered if building up the whole AST when parsing is horribly slow, compared to turning that off (see the wrap and add functions and their call sites).

baudehlo · 2017-02-17T17:47:56Z

We're looking to change the internals to use email-addresses, but one thing holding us back is that you dump comments on the floor. For formatting (which we're happy to keep in address-rfc2822) we need them. Is there any chance you can make them available? Regarding performance, your parser is probably a lot quicker than nearley anyway, as nearley returns every possible parse result, which a recursive descent parser wouldn't, and email address grammar is ambiguous so it is important to use a left-first approach to the grammar.

…

On Fri, Feb 17, 2017 at 12:09 PM, Jack Bearheart ***@***.***> wrote: One more note I forgot, since I saw you discussing performance. To be perfectly honest I have never evaluated the performance of email-addresses, so I suggest you test it for your use case. I've always wondered if building up the whole AST when parsing is horribly slow, compared to turning that off (see the wrap and add functions and their call sites). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAobY9R4AtI-dBJ8X81knBXQGWnoLnMCks5rddRigaJpZM4L-Rq9> .

jackbearheart · 2017-02-17T17:59:21Z

Re: comments, at the end of the parse, the whole ast is available, so it should be pretty easy to pull out comments. The comments are called "comment" https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L343 , so one could find the comments for an address by adding that to the giveResult function, with some code like comments = findAllNodes('comment', addr).

I'll leave it to you to say if that's the kind of API you want, feel free to submit a PR.

Edit: also, without modification to the library as it is right now, you can search the ast yourself and pull the comments out.

baudehlo · 2017-02-17T18:11:18Z

I'll take a poke, thanks. Nice parser btw. Another thing we supported (which isn't part of the grammar) is outlook separators: ";" rather than ",". It's seen every now and then, sadly. Also would you consider an option allowing to disable obs- rules?

…

On Fri, Feb 17, 2017 at 12:59 PM, Jack Bearheart ***@***.***> wrote: Re: comments, at the end of the parse, the whole ast is available, so it should be pretty easy to pull out comments. The comments are called "comment" https://github.com/jackbowman/email-addresses/blob/master/ lib/email-addresses.js#L343 , so one could find the comments for an address by adding that to the giveResult function, with some code like comments = findAllNodes('comment', addr). I'll leave it to you to say if that's the kind of API you want, feel free to submit a PR. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAobY3QKUdZbq0ysJhy_BVvR3wx_lQPqks5rdd_5gaJpZM4L-Rq9> .

jackbearheart · 2017-02-17T18:19:22Z

Thanks.

Could add an override option for ';' separators. I've hidden a few things behind flags and I think that's acceptable. Would be nice if there was some documentation on what grammar outlook really expects, although that may or may not exist.

I have an option "strict" that does exactly that https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L932 . Search strict throughout the code.

This reverts commit 7dc8721.

Revert "Nearley parser for address format (#13)"

Matt Sergeant and others added 7 commits February 9, 2017 17:18

Nearley grammar

12b3794

More work on the grammar parser

3f9d778

More flattening work

b3ed5f3

Simplify grammar a lot

6996337

Tests passing - quite noisy so far

e328ee7

All formats now supported, including some obsolete

410c058

Dump old parser

182b6a1

baudehlo requested a review from msimerson February 11, 2017 19:48

Matt Sergeant added 2 commits February 11, 2017 14:52

Fixup the grammar a tiny bit

3099104

Update docs

6e34ba2

msimerson reviewed Feb 11, 2017

View reviewed changes

Added the extra tests msimerson asked for

1b015f7

msimerson approved these changes Feb 11, 2017

View reviewed changes

msimerson merged commit 7dc8721 into master Feb 12, 2017

msimerson deleted the nearley_parse branch February 12, 2017 00:46

msimerson restored the nearley_parse branch February 18, 2017 22:48

msimerson deleted the nearley_parse branch February 18, 2017 23:24

msimerson added a commit that referenced this pull request Feb 18, 2017

Revert "Nearley parser for address format (#13)"

93da295

This reverts commit 7dc8721.

msimerson added a commit that referenced this pull request Feb 18, 2017

Revert "Nearley parser for address format (#13)"

dd8c305

This reverts commit 7dc8721.

msimerson added a commit that referenced this pull request Feb 19, 2017

Revert "Nearley parser for address format (#13)"

d1d1c75

This reverts commit 7dc8721.

baudehlo added a commit that referenced this pull request Feb 22, 2017

Merge pull request #19 from haraka/nearley-revert-take2

095c2e5

Revert "Nearley parser for address format (#13)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nearley parser for address format #13

Nearley parser for address format #13

baudehlo commented Feb 11, 2017

codecov-io commented Feb 11, 2017 •

edited

Loading

msimerson left a comment

baudehlo commented Feb 11, 2017 via email

msimerson commented Feb 11, 2017 •

edited

Loading

baudehlo commented Feb 11, 2017 via email

msimerson left a comment

msimerson commented Feb 11, 2017

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 12, 2017

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 12, 2017 •

edited

Loading

msimerson commented Feb 12, 2017 •

edited

Loading

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 13, 2017

baudehlo commented Feb 13, 2017 via email

jackbearheart commented Feb 17, 2017

jackbearheart commented Feb 17, 2017

baudehlo commented Feb 17, 2017 via email

jackbearheart commented Feb 17, 2017 •

edited

Loading

baudehlo commented Feb 17, 2017 via email

jackbearheart commented Feb 17, 2017

Nearley parser for address format #13

Nearley parser for address format #13

Conversation

baudehlo commented Feb 11, 2017

codecov-io commented Feb 11, 2017 • edited Loading

Codecov Report

msimerson left a comment

Choose a reason for hiding this comment

baudehlo commented Feb 11, 2017 via email

msimerson commented Feb 11, 2017 • edited Loading

baudehlo commented Feb 11, 2017 via email

msimerson left a comment

Choose a reason for hiding this comment

msimerson commented Feb 11, 2017

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 12, 2017

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 12, 2017 • edited Loading

msimerson commented Feb 12, 2017 • edited Loading

baudehlo commented Feb 12, 2017 via email

msimerson commented Feb 13, 2017

baudehlo commented Feb 13, 2017 via email

jackbearheart commented Feb 17, 2017

jackbearheart commented Feb 17, 2017

baudehlo commented Feb 17, 2017 via email

jackbearheart commented Feb 17, 2017 • edited Loading

baudehlo commented Feb 17, 2017 via email

jackbearheart commented Feb 17, 2017

codecov-io commented Feb 11, 2017 •

edited

Loading

msimerson commented Feb 11, 2017 •

edited

Loading

msimerson commented Feb 12, 2017 •

edited

Loading

msimerson commented Feb 12, 2017 •

edited

Loading

jackbearheart commented Feb 17, 2017 •

edited

Loading