Python domain name list regex

Question

I wish to get all the the domain names in the given string using python. i have tried the below but i am not getting the o/p as expected

str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
list = re.findall(r'([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', str)
print list

I want the output as:

asu.edu , tarantino.cs.ucsb.edu

but what I get is:

[('asu.', ''), ('ucsb.', '')]

What am I missing ?

please don't overwrite built-in types, use my_str and my_list names instead if you don't have any meaningful names for them — Aprillion
– Aprillion, Commented Feb 6, 2016 at 21:29

tjohnson · Accepted Answer · 2016-02-06 21:33:48Z

1

This should work:

import re
my_str = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"
my_list = re.findall(r'(([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*\.)+[a-z]{2,10})', my_str)
print [i[0] for i in my_list]

As Gavin pointed out, you shouldn't use str and list as variable names because they are built-in types in Python.

edited Feb 6, 2016 at 21:33

answered Feb 6, 2016 at 21:29

tjohnson

1,0771 gold badge11 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Aprillion Over a year ago

you can just use non-capturing groups r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}' and print my_list

Community · Accepted Answer · 2017-05-23 12:24:07Z

In [63]: text = "ctcO6OgnWRAxLtu+akRCFwM asu.edu zOiV6Wo6nDnUhQkZO4XTySrTRwLMgozM9R/LyQs2r+Pb tarantino.cs.ucsb.edu,128.111.48.123 ssh-rsa 9SMF4U+qJW03Bh1"

In [64]: re.findall(r'(?:[a-zA-Z0-9]+\.)+[a-z]{2,10}', text)
Out[64]: ['asu.edu', 'tarantino.cs.ucsb.edu']

Use (?:...) to create a non-capturing group. When the pattern contains more than one grouping pattern (i.e. a pattern surrounded by parentheses), re.findall returns a tuple for each match. To prevent re.findall from returning a list of tuples, use non-capturing groups.
For the text you posted, the pattern (-[a-zA-Z0-9]+)*\. is unnecessary. There is no literal - in text so (-[a-zA-Z0-9]+)* never matches anything in text. Of course, you could add (?:-[a-zA-Z0-9]+)* to the pattern if you wish (note the use of the non-capturing group (?:...)), but that part of the pattern is not exercised by the text you posted. It would allow you to match names with hypthens, however:
```
In [73]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', 'asu-psu.edu but not initial hyphens like -psu-asu.edu')
Out[73]: ['asu-psu.edu', 'psu-asu.edu']
```
And as Aprillion noted:
```
In [74]: re.findall(r'(?:[a-zA-Z0-9]+(?:-[a-zA-Z0-9]+)*\.)+[a-z]{2,10}', text)
Out[74]: ['asu.edu', 'tarantino.cs.ucsb.edu']
```
See regex101 for an explanation of the pattern (?:[a-zA-Z0-9]+\.)+[a-z]{2,10}

Collectives™ on Stack Overflow

Python domain name list regex

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related