3

I have a question about Python regex. I don't have much information about Python regex. I am working with HTTP request messages and parsing them with regex. As you know, the HTTP GET messages are in this format.

GET / HTTP/1.0
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Host: 10.2.0.12
Connection: Keep-Alive

I want to parse the URI, method, user-agent, and the host areas of the message. My regex for this job is:

r'^({0})\s+(\S+)\s+[^\n]*$\n.*^User-Agent:\s*(\S+)[^\n]*$\n.*^Host:\s*(\S+)[^\n]*$\n'.format('|'.join(methods)), re.MULTILINE|re.DOTALL)

But, when the message comes up with like

GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive

I can not catch them because of the places of host or, user-agent changed. So I need a generic regex that will catch all of them, even if the places of host, method, uri are changed in the message.

4
  • 4
    This should help you Commented May 31, 2012 at 11:53
  • @tuxuday +1 Searching the freakin' web is the most powerful skill in a developer's toolbox. Commented May 31, 2012 at 12:16
  • I like the method that @tuxuday says. like this m=re.findall(r"(?P<name>.*?): (?P<value>.*?)\r\n", req).but in this method I cannot parse "GET" and http version. Is it better to add them at the beginning? Commented May 31, 2012 at 12:31
  • Do the job the right way. Use cgi.parse_header() to get values from the string, or use some tools like WebOb. Commented May 31, 2012 at 15:49

2 Answers 2

4

Readability Counts (The Zen of Python)

Use findall() for each subexpression you want to find. This way your regex will be short, readable, and independent of the location of the subexpression.

Define a simple, readable regex:

>>> user=re.compile("User-Agent: (.*?)\n")

Test it with two different http headers:

>>> s1='''GET / HTTP/1.0
    Host: 10.2.0.12
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Connection: Keep-Alive'''
>>> s2='''GET / HTTP/1.0
    User-Agent: Wget/1.12 (linux-gnu)
    Accept: */*
    Host: 10.2.0.12
    Connection: Keep-Alive'''
>>> user.findall(s1)
['Wget/1.12 (linux-gnu)']
>>> user.findall(s2)
['Wget/1.12 (linux-gnu)']
Sign up to request clarification or add additional context in comments.

1 Comment

HTTP spec requires /r/n, so I'd suggest to use re.compile("User-Agent: (.*?)\r\n") instead
3

Parse the whole headers into a dictionary like so?

headers = """GET / HTTP/1.0
Host: 10.2.0.12
User-Agent: Wget/1.12 (linux-gnu)
Accept: */*
Connection: Keep-Alive"""


headers = headers.splitlines()
firstLine = headers.pop(0)
(verb, url, version) = firstLine.split()
d = {'verb' : verb, 'url' : url, 'version' : version}
for h in headers:
    h = h.split(': ')
    if len(h) < 2:
        continue
    field=h[0]
    value= h[1]
    d[field] = value

print d

print d['User-Agent']
print d['url']

4 Comments

Remember to strip your values and ignore lines that don't contain : - d=dict([[i.strip() for i in l.split(':')] for l in s1.splitlines() if ":" in l])
And +1 - liked your approach. As I wrote, Readability counts.
I need to parse the "GET method" also. without ':'
Updated to include the first line :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.