Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nearley parser for address format #13

Merged
merged 10 commits into from
Feb 12, 2017
Merged

Nearley parser for address format #13

merged 10 commits into from
Feb 12, 2017

Conversation

baudehlo
Copy link
Contributor

This allows us to support all mail formats, including groups.

Includes all changes up to RFC 6854.

@codecov-io
Copy link

codecov-io commented Feb 11, 2017

Codecov Report

Merging #13 into master will increase coverage by 4.55%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
+ Coverage   89.47%   94.02%   +4.55%     
==========================================
  Files           1        3       +2     
  Lines         171      251      +80     
  Branches       44       49       +5     
==========================================
+ Hits          153      236      +83     
+ Misses         18       15       -3
Impacted Files Coverage Δ
lib/flatten.js 91.3% <91.3%> (ø)
index.js 90.9% <96.42%> (+1.43%)
lib/address_format.js 97.45% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b00588e...1b015f7. Read the comment docs.

Copy link
Member

@msimerson msimerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This morning I looked at this for a bit. I read the docs and a couple examples for nearley and it looks like a great tool. The problem with bodg [email protected] is annoying but I couldn't pin down where that issue is.

So I looked at it from the other end. Did Mail::Address actually return anything useful from that format?

$ perl -e 'use Data::Dumper; use Mail::Address; print Data::Dumper::Dumper(Mail::Address->parse("bodg [email protected]"));'
$VAR1 = bless( [
                 '',
                 'bodg',
                 ''
               ], 'Mail::Address' );
$VAR2 = bless( [
                 '',
                 'fred.ti.com',
                 ''
               ], 'Mail::Address' );

Nope. I spot checked a few others, just for fun against Mail::Address and, uh, I think there's a good reason it's no longer being developed.

So then I thought to myself, node-address-rfc2822 is a pretty lousy name for a module that's attempting to implement RFC 5322 & 6854. Perhaps it should have a more general name, like email-address or some such. And, there should be a more up-to-date set of valid email addresses one can test their parser against...

So then I found email-addresses on NPM. And part of its test suite is the corpus from is_email which has a spiffy web site where you can test email addresses against their validator. Nifty.

There's another set of addresses here that would be good to drop into the test suite. At the very least, it'll increase our confidence in the parser.

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 11, 2017 via email

@msimerson
Copy link
Member

msimerson commented Feb 11, 2017

Mostly I'd like feedback on the concept of switching to a grammar, rather
than hacky regexps. The grammar is probably slower, but it seems to parse
everything reasonably quickly now.

I'm thinking probably not too far down the road someone will want Haraka to join Postfix, gmail, and the other modern SMTP agents that support SMTPUTF8. That will involve:

  • RFC 6530 Overview and Framework for Internationalized Email
  • RFC 6531 SMTP Extension for Internationalized Email
  • RFC 6532 Internationalized Email Headers

When that day arrives, are we better off with nearley or a more purpose-built implementation like the one in email-addresses.js? (That's an honest question, I don't know the answer.) From out here in the cheap seats, a really awesome feature of the purpose-build parser is that we can get back useful error messages about specifically why an address is invalid.

Performance would be interesting to compare, but so long as neither is incredibly slow, any implementation should be good enough.

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 11, 2017 via email

Copy link
Member

@msimerson msimerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@msimerson
Copy link
Member

It might be nice to update the README, and specify precisely which RFCs we expect this parser currently supports, and perhaps which ones we mostly (I'm thinking legacy here) support, and which ones we anticipate supporting in the future (internationalization, maybe?).

@msimerson msimerson merged commit 7dc8721 into master Feb 12, 2017
@msimerson msimerson deleted the nearley_parse branch February 12, 2017 00:46
@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 12, 2017 via email

@msimerson
Copy link
Member

We may be better off using email-addresses as a basis anyway

That's the way this cat leans. I looked through the code in email-addresses and its clean, tidy, well designed, and well tested. It looks straight forward to modify and maintain. It has exponentially more users making it likely to be well maintained into the distant future.

use our module to keep the interface the same.

I wouldn't even do that. There's only one other dependent on this module (besides Haraka). I'd update Haraka to use email-addresses and slap a great big "no longer maintained, we've moved to email-addresses and suggest that you do too" sign on the README.

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 12, 2017 via email

@msimerson
Copy link
Member

msimerson commented Feb 12, 2017

Perhaps merging that code into email-addresses would be better?

Yes, that's exactly what I was thinking. It could even be a separate API file over there that wraps lib/email-addresses.js and externally returns the same responses as this module. Then this module is no longer necessary.

@msimerson
Copy link
Member

msimerson commented Feb 12, 2017

Of course, it's also possible that the nice formatting stuff we do (I had forgotten that, and I'm kinda fond of nameCase) would be considered out-of-scope over there. If merging in those features and their syntactic sugar is not possible, maybe as you said the better approach is just use email-addresses as our parser and continue maintaining this module. 🤷‍♂️

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 12, 2017 via email

@msimerson
Copy link
Member

Also, email-addresses already has RFC-6532 support.

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 13, 2017 via email

@jackbearheart
Copy link

Hi folks. You're wondering if "pretty formatting" is out of scope from email-addresses, correct? I lean towards, yes, it's out of scope, as email-addresses is currently just a parser. I don't know exactly what you intend though so if you have an example I would be curious to hear it.

One thing email-addresses can currently do is pull the name and address out of the string, so maybe if we had jackbearheart/email-addresses#21 then "pretty printing" would be just pulling out the name and address from the parse and passing it to that function? That could be trivial and I would accept that kind of PR.

(Thanks for interest in the parser side and I'm very happy to accept PRs for the parser.)

@jackbearheart
Copy link

One more note I forgot, since I saw you discussing performance. To be perfectly honest I have never evaluated the performance of email-addresses, so I suggest you test it for your use case. I've always wondered if building up the whole AST when parsing is horribly slow, compared to turning that off (see the wrap and add functions and their call sites).

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 17, 2017 via email

@jackbearheart
Copy link

jackbearheart commented Feb 17, 2017

Re: comments, at the end of the parse, the whole ast is available, so it should be pretty easy to pull out comments. The comments are called "comment" https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L343 , so one could find the comments for an address by adding that to the giveResult function, with some code like comments = findAllNodes('comment', addr).

I'll leave it to you to say if that's the kind of API you want, feel free to submit a PR.

Edit: also, without modification to the library as it is right now, you can search the ast yourself and pull the comments out.

@baudehlo
Copy link
Contributor Author

baudehlo commented Feb 17, 2017 via email

@jackbearheart
Copy link

Thanks.

Could add an override option for ';' separators. I've hidden a few things behind flags and I think that's acceptable. Would be nice if there was some documentation on what grammar outlook really expects, although that may or may not exist.

I have an option "strict" that does exactly that https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L932 . Search strict throughout the code.

@msimerson msimerson restored the nearley_parse branch February 18, 2017 22:48
@msimerson msimerson deleted the nearley_parse branch February 18, 2017 23:24
msimerson added a commit that referenced this pull request Feb 18, 2017
msimerson added a commit that referenced this pull request Feb 18, 2017
msimerson added a commit that referenced this pull request Feb 19, 2017
baudehlo added a commit that referenced this pull request Feb 22, 2017
Revert "Nearley parser for address format (#13)"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants