Commit e4f5abda authored by Jonathan Kamens's avatar Jonathan Kamens Committed by Richard P. Curnow

Fix deficiencies in the parsing of mbox From lines

1. The local part of email addresses can actually have a lot more characters in
it than you were recognizing.
2. In addition, quotation marks can be used to include even more characters in
the local part.
3. Valid characters in the domain part of the address are actually much more
restricted than valid characters in the local part.
4. Some software wraps the email address in angle brackets.
5. You don't see it all that often anymore, but sometimes you will see domain
routing notation in From lines, e.g., "@domain1:foo@domain2".
6. Finally, the domain part of an email address could be an IP address wrapped
in square braces rather than a DNS domain name.
parent ea052b5d
......@@ -33,35 +33,76 @@
# LOWER : [a-z]
# UPPER : [A-Z]
# PLUSMINUS : [+-]
# OTHER_EMAIL : other stuff valid in an address, at least [_.]
# OTHER_EMAIL : other stuff valid in the LHS of an address
# DOMAIN : stuff valid in the RHS of an address
Abbrev LF = [\n]
Abbrev CR = [\r]
Abbrev DIGIT = [0-9]
Abbrev PERIOD = [.]
Abbrev AT = [@]
Abbrev LOWER = [a-z]
Abbrev UPPER = [A-Z]
Abbrev COLON = [:]
Abbrev WHITE = [ \t]
Abbrev PLUSMINUS = [+\-]
Abbrev OTHER_EMAIL = [_.=]
# Explained clearly at
# http://en.wikipedia.org/wiki/E-mail_address#RFC_specification
Abbrev OTHER_EMAIL = [.!#$%&'*/=?^_`{|}~]
Abbrev LT = [<]
Abbrev GT = [>]
Abbrev EMAIL = LOWER | UPPER | DIGIT | PLUSMINUS | OTHER_EMAIL
Abbrev OTHER_DOMAIN = [\-_.]
Abbrev DOMAIN = LOWER | UPPER | DIGIT | OTHER_DOMAIN
Abbrev DQUOTE = ["]
Abbrev OTHER_QUOTED = [@:<>]
Abbrev LEFTSQUARE = [[]
Abbrev RIGHTSQUARE = [\]]
BLOCK email {
STATE in
EMAIL -> in, before_at
DQUOTE -> quoted_before_at
AT -> domain_route
STATE domain_route
DOMAIN -> domain_route
COLON -> in
STATE quoted_before_at
EMAIL | WHITE | OTHER_QUOTED -> quoted_before_at
DQUOTE -> before_at
STATE before_at
EMAIL -> before_at
DQUOTE -> quoted_before_at
# Local part only : >=1 characters will suffice, which we've already
# matched.
-> out
AT -> after_at
AT -> start_of_domain
STATE start_of_domain
LEFTSQUARE -> dotted_quad
DOMAIN -> after_at
STATE dotted_quad
DIGIT | PERIOD -> dotted_quad
RIGHTSQUARE -> out
STATE after_at
EMAIL -> after_at, out
DOMAIN -> after_at, out
}
BLOCK angled_email {
STATE in
LT -> in_angles
STATE in_angles
<email:in->out> -> before_gt
STATE before_gt
GT -> out
}
BLOCK zone {
......@@ -145,6 +186,7 @@ BLOCK main {
# Real return address.
WHITE -> in
<email:in->out> -> before_date
<angled_email:in->out> -> before_date
# Cope with Mozilla mbox folder format which just uses a '-' as
# the return address field.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment