4

Say I want to extract the hostname and the port number from a string like this:

stackoverflow.com:443

That is pretty easy. I could do something like this:

(?<host>.*):(?<port>\d*)

I am not worried about protocol schemes or valid host names/ip addresses or tcp/udp ports, it is not important to my request.

However, I also need to support one twist that takes this beyond my knowledge of regular expressions - the host name without the port:

stackoverflow.com

I want to use a single regular expression for this, and I want to use named capture groups such that the host group will always exist in a positive match, while the port group exists if and only if we have a colon followed by a number of digits.

I have tried doing a positive lookbehind from my feeble understanding of it:

(?<host>.*)(?<=:)(?<port>\d*)

This comes close, but the colon (:) is included at the end of the host capture. So I tried to change the host to include anything but the colon like this:

(?<host>[^:]*)(?<=:)(?<port>\d*)

That gives me an empty host capture.

Any suggestions on how to accomplish this, i.e. make the colon and the port number optional, but if they are there, include the port number capture and make the colon "vanish"?

Edit: All the four answers I have received work well for me, but pay attention to the comments in some of them. I accepted sln's answer because of the nice layout and explanation of the regexp structure. Thanks to all that replied!

9
  • 2
    EDITED - Not tested, but try for example this: (?<host>[^:]+)(:(?<port>\d+))? Remember that question mark itself can be used to define optional characters or whole groups. Commented Mar 27, 2014 at 19:39
  • Jerry: I should have mentioned - this is part of a bigger, more complex regexp that does more than just the host/port stuff. So I just wanted to isolate the part I'm having trouble with. Commented Mar 27, 2014 at 19:47
  • Zoltán: So basically a nested expression? Wow, that takes regular expressions to the next headache level. :) Thanks, will try! Commented Mar 27, 2014 at 19:51
  • @RuneJacobsen, yes, because you want a whole optional group (the colon followed by the port), and want to catch the number part of it, so one group inside another makes sense. Commented Mar 27, 2014 at 19:55
  • 2
    Not really a nested expression, an optional capture group that should be an optional cluster group, especially if you are counting named capture groups and/or named groups last in a larger expression. Commented Mar 27, 2014 at 20:00

5 Answers 5

6

I'm suggesting to use Uri class instead of regular expressions.

// Use URI class for parsing only
var uri = new Uri("http://" + fullAddress);
// get host
host = uri.DnsSafeHost;
// get port
portNum = (ushort)uri.Port;

The benefits are

  • It supports:
    • IPv4 and IPv6
    • Internationalized domain name (IDN)
  • Can be extended to take schema into account in the future
  • Short and standardised code, so less mistakes

See sample of using on .NET Fiddle

Sign up to request clarification or add additional context in comments.

2 Comments

This question has been added to the Stack Overflow Regular Expressions FAQ as a non-regex alternative, under "Common Tasks > Validation".
Didn't consider this since the data I am parsing is not necessarily a Uri, but of course it makes sense that you can do it this way as well. :)
2

This maybe (?<host>[^:]+)(?::(?<port>\d+))?

 (?<host> [^:]+ )               # (1), Host, required
 (?:                            # Cluster group start, optional
      :                              # Colon ':'
      (?<port> \d+ )                 # (2), Port number
 )?                             # Cluster group end

edit - If you were to not use the cluster group, and use a capture group as that cluster group instead, this is how Dot-Net "counts" the groups in its default configuration state -

 (?<host> [^:]+ )         #_(2), Host, required                           
 (                        # (1 start), Unnamed capture group, optional
      :                        # Colon ':'
      (?<port> \d+ )           #_(3), Port number                           
 )?                       # (1 end)

2 Comments

This answer has been added to the Stack Overflow Regular Expression FAQ, under "Common Validation Tasks".
@aliteralmind, please consider this answer (stackoverflow.com/a/24399003/968003) for that FAQ instead.
1

If your host name doesn't contain : like ipv64 then try this one:

(?<host>[^:]*):?(?<port>\d*)

4 Comments

This would match "stackoverflow.com8080", wouldn't it?
@ZoltánTamási But OP says not worried about protocol schemes or valid host names
I thought the colon between hostname and port is one level lower than valid host names and protocol schemas :)
Zoltán is right, it would match this, but Sabuj is also right - for this regexp, I want to parse this as well as possible, given potentially malformed input. In other regexps at other points in the code I will validate and warn about illegal/wrong input.
1

Try this:

(?<host>[^:]+)(:(?<port>\d+))?

This makes the whole colon and port number part an optional group, and catches the port number inside that. Also, I used the plus sign to ensure that hostname and port number contains at least one character.

Comments

1

You can use this :

(?<host>[^:]+)(:(?<port>\\d+))?

4 Comments

This works, but could you possibly explain the reason for the two backslashes in front of the d? I.e. I understand that \d represents a digit. The difference between one and two backslashes seems to be the number of capture groups returned.
It's for escaping the backslash in C# strings. It shouldn't be there in this context but in a normal c# string you have to escape it as you know.
@user3246354, regex should be almost always declared with verbatim string using at sign, so you don't need to worry about escaping the backslashes. Usually a regex is complex enough without that too.
Yes, it was my mistake.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.