5

I am trying to parse an RFC 5322 compliant "From: " field in an e-mail message into two parts: the display-name, and the e-mail address, in Python 2.7 (the display-name could be empty). The familiar example is something like

John Smith <[email protected]>

In above, John Smith is the display-name and [email protected] is the email address. But the following is also a valid "From: " field:

"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>

In this example, the return value for display-name is

"unusual" 

and

"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com

is the email address.

You can use grammars to parse this in Perl (as explained in these questions: Using a regular expression to validate an email address and The recognizing power of “modern” regexes), but I'd like to do this in Python 2.7. I have tried using email.parser module in Python, but that module seems only to be able to separate those fields that are distinguished by a colon. So, if you do something like

from email.parser import Parser
headers = Parser().parsestr('From: "John Smith" <[email protected]>')
print headers['from'] 

it will return

"John Smith" <[email protected]> 

while if you replace the last line in the above code with

print headers['display-name']

it will return

None

I'll very much appreciate any suggestions and comments.

3
  • I'd suggest getting it to work? You need to give more information about the problem before anyone can give more specific help. Commented Oct 6, 2013 at 23:11
  • Thanks. You're right. I'll try to clarify. Commented Oct 6, 2013 at 23:13
  • 1
    The headers['display-name'] does not make sense. The display-name is not a field of the header, but of the 1st email address in the From: ... header. Commented Oct 6, 2013 at 23:54

2 Answers 2

9

headers['display-name'] is not part of the email.parser api.

Try email.utils.parseaddr:

In [17]: email.utils.parseaddr("[email protected]")
Out[17]: ('', '[email protected]')

In [18]: email.utils.parseaddr("(John Smith) [email protected]")
Out[18]: ('John Smith', '[email protected]')

In [19]: email.utils.parseaddr("John Smith <[email protected]>")
Out[19]: ('John Smith', '[email protected]')

It also handles your unusual address:

In [21]: email.utils.parseaddr('''"unusual" <"very.(),:;<>[]\".VERY.\"very@\\ \"very\".unusual"@strange.example.com>''')
Out[21]: ('unusual', '"very.(),:;<>[]".VERY."very@ "very".unusual"@strange.example.com')
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! This is perfect! It's exactly what I was looking for.
1

I wrote such a parser in libtld in C++. If you want to really be complete, there is the lex and yacc (although I do not use those tools). My C++ code may help you write your own version in python.

(lex part)
[-A-Za-z0-9!#$%&'*+/=?^_`{|}~]+                                          atom_text_repeat (ALPHA+DIGIT+some other characters)
([\x09\x0A\x0D\x20-\x27\x2A-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+           comment_text_repeat
([\x33-\x5A\x5E-\x7E])+                                                  domain_text_repeat
([\x21\x23-\x5B\x5D-\x7E]|\\[\x09\x20-\x7E])+                            quoted_text_repeat
\x22                                                                     DQUOTE
[\x20\x09]*\x0D\x0A[\x20\x09]+                                           FWS
.                                                                        any other character

(lex definitions merged in more complex lex definitions)
[\x01-\x08\x0B\x0C\x0E-\x1F\x7F]                                         NO_WS_CTL
[()<>[\]:;@\\,.]                                                         specials
[\x01-\x09\x0B\x0C\x0E-\x7F]                                             text
\\[\x09\x20-\x7E]                                                        quoted_pair ('\\' text)
[A-Za-z]                                                                 ALPHA
[0-9]                                                                    DIGIT
[\x20\x09]                                                               WSP
\x20                                                                     SP
\x09                                                                     HTAB
\x0D\x0A                                                                 CRLF
\x0D                                                                     CR
\x0A                                                                     LF

(yacc part)
address_list: address
            | address ',' address_list
address: mailbox
       | group
mailbox_list: mailbox
            | mailbox ',' mailbox_list
mailbox: name_addr
       | addr_spec
group: display_name ':' mailbox_list ';' CFWS
     | display_name ':' CFWS ';' CFWS
name_addr: angle_addr
         | display_name angle_addr
display_name: phrase
angle_addr: CFWS '<' addr_spec '>' CFWS
addr_spec: local_part '@' domain
local_part: dot_atom
          | quoted_string
domain: dot_atom
      | domain_literal
domain_literal: CFWS '[' FWS domain_text_repeat FWS ']' CFWS
phrase: word
      | word phrase
word: atom
    | quoted_string
atom: CFWS atom_text_repeat CFWS
dot_atom: CFWS dot_atom_text CFWS
dot_atom_text: atom_text_repeat
             | atom_text_repeat '.' dot_atom_text
quoted_string: CFWS DQUOTE quoted_text_repeat DQUOTE CFWS
CFWS: <empty>
    | FWS comment
    | CFWS comment FWS
comment: '(' comment_content ')'
comment_content: comment_text_repeat
               | comment
               | ccontent ccontent

2 Comments

Thanks! I was trying to avoid writing a parser.
Ah! It wasn't clear in the question that you did not want to write the actual parser. 8-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.