How to parse multiple lines with one regex command?

Question

I have two line that looks something like

Content-Type: text/plain
Content-Type: text/plain; charset=UTF-8

To parse, I used a command like ("^Content-Type:\s(.*)") to capture the (text/plain) portion. On the other hand, I used regex like ("^Content-Type:\s(.*)[;]") to capture the same string (text/plain). Is there any way that I can use one that will work in both cases? I am using python and I am new to regex. thanks

Tags should inform users about your language. I edited that in for you this time. — Mad Physicist
– Mad Physicist, Commented Jul 24, 2017 at 20:01
^Content-Type:\s+(.*?)(?=>;|$) although you don't need regex at all for such a simple case. — zwer
– zwer, Commented Jul 24, 2017 at 20:05

score 2 · Accepted Answer · 2017-07-24 20:16:46Z

2

You can just modify your Regex a bit:

Content-Type:\s([^;\s]*)

Here is a working link: Regex101

edited Jul 24, 2017 at 20:16

answered Jul 24, 2017 at 20:06

user5684647

Sign up to request clarification or add additional context in comments.

1 Comment

Mad Physicist Over a year ago

OP is trying to capture the content type string, not the whole header line: text/plain, not Content-Type: text/plain.

Mad Physicist · Accepted Answer · 2017-07-24 20:13:22Z

It looks like you are looking for the ? quantifier (6th down in the list in the docs). It will allow the trailing portion to appear once or not at all, covering both cases:

^Content-Type:\s+([^;]+)(?:;.*)?

Here are the changes I would recommend:

Do not capture . in your capture group. * is greedy, so you will get undesirable characters sometimes: e.g. if you have two semicolons in the string, the first one will get captured. Instead, capture [^;], which means "anything but semicolons".
Change the quantifier in the main catpure group from * to +. You want at least one character to match, which is what + expresses.
I would also add the + quantifier to the preceding \s just to be safe. It will allow you to match multiple spaces, should that ever happen.
Make the part that matches the ; into a non-capturing group (a group starting with (?:. This allows you to apply the ? quantifier to it.

As @RudyTheHunter indirectly points out, if you use plain re.match, you don't need the leading ^ or the trailing portion after the semicolon at all since match looks in the beginning of the string.

You can therefore use just

Content-Type:\s+([^;]+)

zwer · Accepted Answer · 2017-07-24 20:41:04Z

0

As I've stated in the comment, regex is an overkill for such a simple match, so for the sake of completeness:

def parse_content_type(data):
    if data.lower()[:13] == "content-type:":  # HTTP headers are case-insensitive by spec.
        index = data.find(";")  # find the position of `;`
        return data[13:index if index > -1 else len(data)].strip()  # slice and strip

print(parse_content_type("Content-Type: text/plain"))  # text/plain
print(parse_content_type("Content-Type: text/plain; charset=UTF-8"))  # text/plain

It's more verbose but, in theory, it should be faster.

answered Jul 24, 2017 at 20:41

zwer

25.9k3 gold badges53 silver badges70 bronze badges

Collectives™ on Stack Overflow

How to parse multiple lines with one regex command?

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related