-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nearley parser for address format #13
Conversation
Codecov Report
@@ Coverage Diff @@
## master #13 +/- ##
==========================================
+ Coverage 89.47% 94.02% +4.55%
==========================================
Files 1 3 +2
Lines 171 251 +80
Branches 44 49 +5
==========================================
+ Hits 153 236 +83
+ Misses 18 15 -3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This morning I looked at this for a bit. I read the docs and a couple examples for nearley and it looks like a great tool. The problem with bodg [email protected]
is annoying but I couldn't pin down where that issue is.
So I looked at it from the other end. Did Mail::Address actually return anything useful from that format?
$ perl -e 'use Data::Dumper; use Mail::Address; print Data::Dumper::Dumper(Mail::Address->parse("bodg [email protected]"));'
$VAR1 = bless( [
'',
'bodg',
''
], 'Mail::Address' );
$VAR2 = bless( [
'',
'fred.ti.com',
''
], 'Mail::Address' );
Nope. I spot checked a few others, just for fun against Mail::Address and, uh, I think there's a good reason it's no longer being developed.
So then I thought to myself, node-address-rfc2822
is a pretty lousy name for a module that's attempting to implement RFC 5322 & 6854. Perhaps it should have a more general name, like email-address
or some such. And, there should be a more up-to-date set of valid email addresses one can test their parser against...
So then I found email-addresses on NPM. And part of its test suite is the corpus from is_email which has a spiffy web site where you can test email addresses against their validator. Nifty.
There's another set of addresses here that would be good to drop into the test suite. At the very least, it'll increase our confidence in the parser.
Yeah I checked Email::Address too, which is more up to date. It fails on
that address too. So I just stripped it from the test suite (along with a
few others).
I don't know about renaming it - people know it as rfcs 821 and 822 pretty
well, so it's not a terrible name. And given email::addresses is used, I
can't see what to do.
Most of those tests (particularly the is_email ones) are rfc 821 addresses,
so won't help extend the test suite much.
Mostly I'd like feedback on the concept of switching to a grammar, rather
than hacky regexps. The grammar is probably slower, but it seems to parse
everything reasonably quickly now.
…On Sat, Feb 11, 2017 at 4:29 PM, Matt Simerson ***@***.***> wrote:
***@***.**** commented on this pull request.
This morning I looked at this for a bit. I read the docs and a couple
examples for *nearley* and it looks like a great tool. The problem with bodg
***@***.*** is annoying but I couldn't pin down where that issue is.
So I looked at it from the other end. Did Mail::Address actually return
anything useful from that format?
$ perl -e 'use Data::Dumper; use Mail::Address; print Data::Dumper::Dumper(Mail::Address->parse("bodg ***@***.***"));'$VAR1 = bless( [
'',
'bodg',
''
], 'Mail::Address' );$VAR2 = bless( [
'',
'fred.ti.com',
''
], 'Mail::Address' );
Nope. I spot checked a few others, just for fun against Mail::Address and,
uh, I think there's a good reason it's no longer being developed.
So then I thought to myself, node-address-rfc2822 is a pretty lousy name
for a module that's attempting to implement RFC 5322 & 6854. Perhaps it
should have a more general name, like email-address or some such. And,
there should be a more up-to-date set of valid email addresses one can test
their parser against...
So then I found email-addresses
<https://github.com/jackbowman/email-addresses> on NPM. And part of its
test suite is the corpus from is_email
<https://github.com/dominicsayers/isemail> which has a spiffy web site
<https://isemail.info> where you can test email addresses against their
validator. Nifty.
There's another set of addresses here
<https://github.com/snoj/email-validation/blob/master/tests/test-addresses.js>
that would be good to drop into the test suite. At the very least, it'll
increase our confidence in the parser.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAobY_r6WUauTT1ghT69UEeVZU3LxZFbks5rbihTgaJpZM4L-Rq9>
.
|
I'm thinking probably not too far down the road someone will want Haraka to join Postfix, gmail, and the other modern SMTP agents that support SMTPUTF8. That will involve:
When that day arrives, are we better off with nearley or a more purpose-built implementation like the one in email-addresses.js? (That's an honest question, I don't know the answer.) From out here in the cheap seats, a really awesome feature of the purpose-build parser is that we can get back useful error messages about specifically why an address is invalid. Performance would be interesting to compare, but so long as neither is incredibly slow, any implementation should be good enough. |
I've added those other tests in btw, to the best of my ability.
I'm not sure if an out of the box parser or nearley is better. In my
experience a hand-built one is faster, but if you look at the internals of
email-addresses, it's almost identical to what a parser would do (and is
built the same way). I suspect using nearley makes it easier to support
newer changes in the future, and the error reporting isn't terrible.
…On Sat, Feb 11, 2017 at 5:51 PM, Matt Simerson ***@***.***> wrote:
Mostly I'd like feedback on the concept of switching to a grammar, rather
than hacky regexps. The grammar is probably slower, but it seems to parse
everything reasonably quickly now.
I'm thinking probably not too far down the road someone will want Haraka
to join Postfix, gmail, and the other modern SMTP agents) that support
SMTPUTF8. That will involve:
- RFC 6530 <https://tools.ietf.org/html/rfc6530> Overview and
Framework for Internationalized Email
- RFC 6531 <https://tools.ietf.org/html/rfc6531> SMTP Extension for
Internationalized Email
- RFC 6532 <https://tools.ietf.org/html/rfc6532> Internationalized
Email Headers
When that day arrives, are we better off with nearley or a more
purpose-built implementation like the one in email-addresses.js
<https://github.com/jackbowman/email-addresses>? (That's an honest
question, I don't know the answer.) From out here in the cheap seats, a
really awesome feature of the purpose-build parser is that we can get back
useful error messages about specifically why an address is invalid.
Performance would be interesting to compare, but so long as neither is
incredibly slow, any implementation should be good enough.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAobY9HiuCDm6Zjqf3D6J8_UjPLyGDzyks5rbjtpgaJpZM4L-Rq9>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
It might be nice to update the README, and specify precisely which RFCs we expect this parser currently supports, and perhaps which ones we mostly (I'm thinking legacy here) support, and which ones we anticipate supporting in the future (internationalization, maybe?). |
Thinking about this a bit more - maybe nearley is the wrong answer for us. It had exponentially bad behaviour on longer strings and a recursive descent parser is probably better. We may be better off using email-addresses as a basis anyway and just use our module to keep the interface the same.
… On Feb 11, 2017, at 7:41 PM, Matt Simerson ***@***.***> wrote:
Merged #13.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
That's the way this cat leans. I looked through the code in email-addresses and its clean, tidy, well designed, and well tested. It looks straight forward to modify and maintain. It has exponentially more users making it likely to be well maintained into the distant future.
I wouldn't even do that. There's only one other dependent on this module (besides Haraka). I'd update Haraka to use email-addresses and slap a great big "no longer maintained, we've moved to email-addresses and suggest that you do too" sign on the README. |
Well for one, address-rfc2822 does much nicer things with the format()
method. I don't think deprecating that completely is the right thing to do.
Perhaps merging that code into email-addresses would be better?
…On Sun, Feb 12, 2017 at 5:42 PM, Matt Simerson ***@***.***> wrote:
We may be better off using email-addresses as a basis anyway
That's the way this cat leans. I looked through the code in
email-addresses and its clean, tidy, well designed, and well tested. It
looks straight forward to modify and maintain. It has exponentially more
users making it likely to be well maintained into the distant future.
use our module to keep the interface the same.
I wouldn't even do that. There's only one other dependent on this module
(besides Haraka). I'd update Haraka to use email-addresses and slap a great
big "no longer maintained, we've moved to email-addresses and suggest that
you do too" sign on the README.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAobY22nzYLsfAyA8F7Me_S00oA_psMvks5rb4rXgaJpZM4L-Rq9>
.
|
Yes, that's exactly what I was thinking. It could even be a separate API file over there that wraps |
Of course, it's also possible that the nice formatting stuff we do (I had forgotten that, and I'm kinda fond of nameCase) would be considered out-of-scope over there. If merging in those features and their syntactic sugar is not possible, maybe as you said the better approach is just use email-addresses as our parser and continue maintaining this module. 🤷♂️ |
Let's ask the author.
@jackbowman what are your thoughts?
… On Feb 12, 2017, at 6:13 PM, Matt Simerson ***@***.***> wrote:
Of course, it's also possible that the nice formatting stuff we do (I had forgotten that, and I'm kinda fond of nameCase) would be considered out-of-scope over there. If merging in those features and their syntactic sugar, maybe as you said the better approach is just use email-addresses as our parser and continue maintaining this module. 🤷♂️
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Also, email-addresses already has RFC-6532 support. |
Yep. I didn't see a way to switch to Sender or Reply-To mode. And also not sure if it would parse Unicode content (the strict parser I first made did not).
… On Feb 12, 2017, at 7:34 PM, Matt Simerson ***@***.***> wrote:
Also, email-addresses already has RFC-6532 support.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi folks. You're wondering if "pretty formatting" is out of scope from email-addresses, correct? I lean towards, yes, it's out of scope, as One thing email-addresses can currently do is pull the name and address out of the string, so maybe if we had jackbearheart/email-addresses#21 then "pretty printing" would be just pulling out the name and address from the parse and passing it to that function? That could be trivial and I would accept that kind of PR. (Thanks for interest in the parser side and I'm very happy to accept PRs for the parser.) |
One more note I forgot, since I saw you discussing performance. To be perfectly honest I have never evaluated the performance of email-addresses, so I suggest you test it for your use case. I've always wondered if building up the whole AST when parsing is horribly slow, compared to turning that off (see the |
We're looking to change the internals to use email-addresses, but one thing
holding us back is that you dump comments on the floor. For formatting
(which we're happy to keep in address-rfc2822) we need them. Is there any
chance you can make them available?
Regarding performance, your parser is probably a lot quicker than nearley
anyway, as nearley returns every possible parse result, which a recursive
descent parser wouldn't, and email address grammar is ambiguous so it is
important to use a left-first approach to the grammar.
…On Fri, Feb 17, 2017 at 12:09 PM, Jack Bearheart ***@***.***> wrote:
One more note I forgot, since I saw you discussing performance. To be
perfectly honest I have never evaluated the performance of email-addresses,
so I suggest you test it for your use case. I've always wondered if
building up the whole AST when parsing is horribly slow, compared to
turning that off (see the wrap and add functions and their call sites).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAobY9R4AtI-dBJ8X81knBXQGWnoLnMCks5rddRigaJpZM4L-Rq9>
.
|
Re: comments, at the end of the parse, the whole ast is available, so it should be pretty easy to pull out comments. The comments are called "comment" https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L343 , so one could find the comments for an address by adding that to the I'll leave it to you to say if that's the kind of API you want, feel free to submit a PR. Edit: also, without modification to the library as it is right now, you can search the ast yourself and pull the comments out. |
I'll take a poke, thanks. Nice parser btw.
Another thing we supported (which isn't part of the grammar) is outlook
separators: ";" rather than ",". It's seen every now and then, sadly.
Also would you consider an option allowing to disable obs- rules?
…On Fri, Feb 17, 2017 at 12:59 PM, Jack Bearheart ***@***.***> wrote:
Re: comments, at the end of the parse, the whole ast is available, so it
should be pretty easy to pull out comments. The comments are called
"comment" https://github.com/jackbowman/email-addresses/blob/master/
lib/email-addresses.js#L343 , so one could find the comments for an
address by adding that to the giveResult function, with some code like comments
= findAllNodes('comment', addr).
I'll leave it to you to say if that's the kind of API you want, feel free
to submit a PR.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#13 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAobY3QKUdZbq0ysJhy_BVvR3wx_lQPqks5rdd_5gaJpZM4L-Rq9>
.
|
Thanks. Could add an override option for ';' separators. I've hidden a few things behind flags and I think that's acceptable. Would be nice if there was some documentation on what grammar outlook really expects, although that may or may not exist. I have an option "strict" that does exactly that https://github.com/jackbowman/email-addresses/blob/master/lib/email-addresses.js#L932 . Search |
This reverts commit 7dc8721.
This reverts commit 7dc8721.
This reverts commit 7dc8721.
Revert "Nearley parser for address format (#13)"
This allows us to support all mail formats, including groups.
Includes all changes up to RFC 6854.