12

Suppose I have a string of the of the format host:port, where :port is optional. How can I reliably extract the two components?

The host can be any of:

  • A hostname (localhost, www.google.com)
  • An IPv4 literal (1.2.3.4)
  • An IPv6 literal ([aaaa:bbbb::cccc]).

In other words, this is the standard format used across the internet (such as in URIs: complete grammar at https://www.rfc-editor.org/rfc/rfc3986#section-3.2, excluding the "User Information" component).

So, some possible inputs, and desired outputs:

'localhost' -> ('localhost', None)
'my-example.com:1234' -> ('my-example.com', 1234)
'1.2.3.4' -> ('1.2.3.4', None)
'[0abc:1def::1234]' -> ('[0abc:1def::1234]', None)
6
  • It's sort-of hard to do this in Python (purely) where the delimiter is a factor inside the actual IPv6 address itself. Could you revise that? Commented Oct 22, 2017 at 17:00
  • my best attempt so far is to use a regex to special-case the ipv6 literal case, and otherwise use split. Commented Oct 22, 2017 at 17:08
  • клйкбаутоь мажаз вайкукас: well, host:port is a fairly common format, so I can't really change that. Commented Oct 22, 2017 at 17:10
  • honestly, I'm disappointed by the downvotes here. I was surprised this wasn't a common question, and thought that it might be useful to collect a few replies to see whether anyone could come up with an elegant solution. Commented Oct 22, 2017 at 17:12
  • 1
    Well, there we go. Two sets of codez in the answers. Commented Oct 22, 2017 at 17:27

7 Answers 7

15

Well, this is Python, with batteries included. You have mention that the format is the standard one used in URIs, so how about urllib.parse?

import urllib.parse

def parse_hostport(hp):
    # urlparse() and urlsplit() insists on absolute URLs starting with "//"
    result = urllib.parse.urlsplit('//' + hp)
    return result.hostname, result.port

This should handle any valid host:port you can throw at it.

Sign up to request clarification or add additional context in comments.

4 Comments

good idea, although if it's important to detect and reject invalid host:port specifications, this won't cut the mustard
This does not work if hp is an IPv6 (ie: ::1)
If you need to parse a bare IPv6 literal address without square brackets, then no, this solution does not work. The OP specifies IPv6 literals in square brackets, which is required in host names.
Indeed, this was more a side note than a complaint.
2

This should handle the whole parse in a single regex

regex = re.compile(r'''
(                            # first capture group = Addr
  \[                         # literal open bracket                       IPv6
    [:a-fA-F0-9]+            # one or more of these characters
  \]                         # literal close bracket
  |                          # ALTERNATELY
  (?:                        #                                            IPv4
    \d{1,3}\.                # one to three digits followed by a period
  ){3}                       # ...repeated three times
  \d{1,3}                    # followed by one to three digits
  |                          # ALTERNATELY
  [-a-zA-Z0-9.]+              # one or more hostname chars ([-\w\d\.])      Hostname
)                            # end first capture group
(?:                          
  :                          # a literal :
  (                          # second capture group = PORT
    \d+                      # one or more digits
  )                          # end second capture group
 )?                          # ...or not.''', re.X)

All that's needed then is to cast the second group to int.

def parse_hostport(hp):
    # regex from above should be defined here.
    m = regex.match(hp)
    addr, port = m.group(1, 2)
    try:
        return (addr, int(port))
    except TypeError:
        # port is None
        return (addr, None)

7 Comments

Sorry, I should have been clearer: the host can be any hostname, not just localhost. Will amend the question.
Nicely explained with comments on each row. +1
@richvdh fixed.
@AdamSmith think you forgot dashes!
@richvdh I'm not sure what you mean. This code interprets all your test cases correctly. Try it here
|
1
def split_host_port(string):
    if not string.rsplit(':', 1)[-1].isdigit():
        return (string, None)

    string = string.rsplit(':', 1)

    host = string[0]  # 1st index is always host
    port = int(string[1])

    return (host, port)

Actually confused on whether this is what you wanted, but I rewrote it up a bit and it still seems to follow the ideal output:

>>>> split_host_port("localhost")
('localhost', None)
>>>> split_host_port("example.com:1234")
('example.com', 1234)
>>>> split_host_port("1.2.3.4")
('1.2.3.4', None)
>>>> split_host_port("[0abc:1def::1234]")
('[0abc:1def::1234]', None)
>>>> 

As on the first line I didn't really like the chained function calls e.g. getattr(getattr(getattr(string, 'rsplit')(':', 1), '__getitem__')(-1), 'isdigit')() for the expanded version and then it's repeated again two lines after, perhaps I should make it a variable instead so there's no need for all the calls.

But I'm nitpicking here so feel free to call me out on that, heh.

2 Comments

This is interesting because it avoids the need for exceptions. Some ideas for enhancement: * do the isdigit test before the split - if it fails, we just return (string, None) * use `rsplit(':',1) to avoid having to stick the host back together again afterwards.
Thanks, I'll change it now-- actually never knew str.rsplit existed, would've done some weird code like string[::-1].split(':') and joined it back after doing the necessary stuff, TIL I guess :D
1

Here's my final attempt, with credit to other answerers who provided inspiration:

def parse_hostport(s, default_port=None):
    if s[-1] == ']':
        # ipv6 literal (with no port)
        return (s, default_port)

    out = s.rsplit(":", 1)
    if len(out) == 1:
        # No port
        port = default_port
    else:
        try:
            port = int(out[1])
        except ValueError:
            raise ValueError("Invalid host:port '%s'" % s)

    return (out[0], port)

Comments

1

Came up with a dead simple regexp that seems to work in most cases:

def get_host_pair(value):
    return re.search(r'^(.*?)(?::(\d+))?$', value).groups()

get_host_pair('localhost')
get_host_pair('localhost:80')
get_host_pair('[::1]')
get_host_pair('[::1]:8080')

It probably doesn't work when the base input is invalid however

1 Comment

Can you please edit the post to clarify how this works (so that one can gauge whether it is reliably enough for a use-case)? At a glance, this relies on the port being "digits after a : and before end of string", right? So it would be sensitive to whitespace but handle the usual, compact format?
0

Here's my attempt at this so far:

def parse_hostport(hp):
    """ parse a host:port pair
    """
    # start by special-casing the ipv6 literal case
    x = re.match('^(\[[0-9a-fA-F:]+\])(:(\d+))?$', hp)
    if x is not None:
        return x.group(1, 3)

    # otherwise, just split at the (hopefully only) colon
    splits = hp.split(':')

    if len(splits) == 1:
        return splits + [None,]
    elif len(splits) == 2:
        return splits

    raise ValueError("Invalid host:port input '%s'" % hp)

Comments

0

Here's a terser implementation which relies on attempting to parse the last component as an int:

def parse_hostport(s):
    out = s.rsplit(":", 1)
    try:
        out[1] = int(out[1])
    except (IndexError, ValueError):
        # couldn't parse the last component as a port, so let's
        # assume there isn't a port.
        out = (s, None)
    return out

1 Comment

I'm not a huge fan of this implementation, largely because my instinct is to avoid code which throws exceptions in unexceptional circumstances. Still, I think it's better than the regex version.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.