Python regular expression for HTTP Request header

Question

I have a question about Python regex. I don't have much information about Python regex. I am working with HTTP request messages and parsing them with regex. As you know, the HTTP GET messages are in this format.

GET / HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: 10.2.0.12
Connection: Keep-Alive

I want to parse the URI, method, user-agent, and the host areas of the message. My regex for this job is:

r'^({0})\s+(\S+)\s+[^\n]*$\n.*^User-Agent:\s*(\S+)[^\n]*$\n.*^Host:\s*(\S+)[^\n]*$\n'.format('|'.join(methods)), re.MULTILINE|re.DOTALL)

But, when the message comes up with like

GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive

I can not catch them because of the places of host or, user-agent changed. So I need a generic regex that will catch all of them, even if the places of host, method, uri are changed in the message.

@tuxuday +1 Searching the freakin' web is the most powerful skill in a developer's toolbox. — Adam Matan
– Adam Matan, Commented May 31, 2012 at 12:16
I like the method that @tuxuday says. like this m=re.findall(r"(?P<name>.*?): (?P<value>.*?)\r\n", req).but in this method I cannot parse "GET" and http version. Is it better to add them at the beginning? — barp
– barp, Commented May 31, 2012 at 12:31
Do the job the right way. Use cgi.parse_header() to get values from the string, or use some tools like WebOb. — kimjxie
– kimjxie, Commented May 31, 2012 at 15:49

Adam Matan · Accepted Answer · 2012-05-31 13:37:01Z

4

Readability Counts (The Zen of Python)

Use findall() for each subexpression you want to find. This way your regex will be short, readable, and independent of the location of the subexpression.

Define a simple, readable regex:

>>> user=re.compile("User-Agent: (.*?)\n")

Test it with two different http headers:

>>> s1='''GET / HTTP/1.0
    Host: 10.2.0.12
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Connection: Keep-Alive'''
>>> s2='''GET / HTTP/1.0
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Host: 10.2.0.12
    Connection: Keep-Alive'''
>>> user.findall(s1)
['Wget/1.12 (linux-gnu)']
>>> user.findall(s2)
['Wget/1.12 (linux-gnu)']

edited May 31, 2012 at 13:37

answered May 31, 2012 at 11:54

Adam Matan

138k155 gold badges414 silver badges585 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sergey Kandaurov Over a year ago

HTTP spec requires /r/n, so I'd suggest to use re.compile("User-Agent: (.*?)\r\n") instead

Maria Zverina · Accepted Answer · 2012-05-31 12:50:41Z

3

Parse the whole headers into a dictionary like so?

headers = """GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive"""


headers = headers.splitlines()
firstLine = headers.pop(0)
(verb, url, version) = firstLine.split()
d = {'verb' : verb, 'url' : url, 'version' : version}
for h in headers:
    h = h.split(': ')
    if len(h) < 2:
        continue
    field=h[0]
    value= h[1]
    d[field] = value

print d

print d['User-Agent']
print d['url']

edited May 31, 2012 at 12:50

answered May 31, 2012 at 11:57

Maria Zverina

11.2k3 gold badges47 silver badges62 bronze badges

4 Comments

Adam Matan Over a year ago

Remember to strip your values and ignore lines that don't contain : - d=dict([[i.strip() for i in l.split(':')] for l in s1.splitlines() if ":" in l])

Adam Matan Over a year ago

And +1 - liked your approach. As I wrote, Readability counts.

barp Over a year ago

I need to parse the "GET method" also. without ':'

Maria Zverina Over a year ago

Updated to include the first line :)

Collectives™ on Stack Overflow

Python regular expression for HTTP Request header

2 Answers 2

1 Comment

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related