0

I have two line that looks something like

Content-Type: text/plain
Content-Type: text/plain; charset=UTF-8

To parse, I used a command like ("^Content-Type:\s(.*)") to capture the (text/plain) portion. On the other hand, I used regex like ("^Content-Type:\s(.*)[;]") to capture the same string (text/plain). Is there any way that I can use one that will work in both cases? I am using python and I am new to regex. thanks

2
  • Tags should inform users about your language. I edited that in for you this time. Commented Jul 24, 2017 at 20:01
  • ^Content-Type:\s+(.*?)(?=>;|$) although you don't need regex at all for such a simple case. Commented Jul 24, 2017 at 20:05

3 Answers 3

2

You can just modify your Regex a bit:

Content-Type:\s([^;\s]*)

Here is a working link: Regex101

Sign up to request clarification or add additional context in comments.

1 Comment

OP is trying to capture the content type string, not the whole header line: text/plain, not Content-Type: text/plain.
0

It looks like you are looking for the ? quantifier (6th down in the list in the docs). It will allow the trailing portion to appear once or not at all, covering both cases:

^Content-Type:\s+([^;]+)(?:;.*)?

Here are the changes I would recommend:

  • Do not capture . in your capture group. * is greedy, so you will get undesirable characters sometimes: e.g. if you have two semicolons in the string, the first one will get captured. Instead, capture [^;], which means "anything but semicolons".
  • Change the quantifier in the main catpure group from * to +. You want at least one character to match, which is what + expresses.
  • I would also add the + quantifier to the preceding \s just to be safe. It will allow you to match multiple spaces, should that ever happen.
  • Make the part that matches the ; into a non-capturing group (a group starting with (?:. This allows you to apply the ? quantifier to it.

As @RudyTheHunter indirectly points out, if you use plain re.match, you don't need the leading ^ or the trailing portion after the semicolon at all since match looks in the beginning of the string.

You can therefore use just

Content-Type:\s+([^;]+)

Comments

0

As I've stated in the comment, regex is an overkill for such a simple match, so for the sake of completeness:

def parse_content_type(data):
    if data.lower()[:13] == "content-type:":  # HTTP headers are case-insensitive by spec.
        index = data.find(";")  # find the position of `;`
        return data[13:index if index > -1 else len(data)].strip()  # slice and strip

print(parse_content_type("Content-Type: text/plain"))  # text/plain
print(parse_content_type("Content-Type: text/plain; charset=UTF-8"))  # text/plain

It's more verbose but, in theory, it should be faster.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.