Regular expression for matching numbers and strings

Question

i have series of words as given below

476pe
e586
9999
rrr
ABCF

i have to write a regular expression that will match the numbers and numbers with alphabets.From the above strings i have to match only

476pe
e586
9999

I tried to write a regular expression as given below

^[\D]*[0-9]+[\D]*$

but it doesn't work. I tried this with an online regex tool, http://rubular.com/r/HQE2vG0pbu and it shows that the entire string is matching.

That regular expression, while inelegant, should work in principle (if your goal is to match any string that contains exactly one number (with any number of digits)). How are you using it? — Tim Pietzcker
– Tim Pietzcker, Commented Apr 22, 2014 at 7:53
In other words, should a string like "12a34" match or not? Also, do you really want to match strings like "(/3)!§" (which your regex currently matches)? — Tim Pietzcker
– Tim Pietzcker, Commented Apr 22, 2014 at 7:55
Also, what is your regex matching that you don't want it to match? What is it failing to match that you do want it to match? — Kyle Strand
– Kyle Strand, Commented Apr 22, 2014 at 7:56
Tim, i was trying it with an online regex tool, rubular.com/r/HQE2vG0pbu it shows all strings are matching — mystack
– mystack, Commented Apr 22, 2014 at 7:56
but it didn't works. What do you mean by that? Help us to help you. — thefourtheye
– thefourtheye, Commented Apr 22, 2014 at 7:57

Community · Accepted Answer · 2017-05-23 11:57:28Z

Since other answers have given a lot of ways to solve your problem, let me try to explain the behavior you witnessed.

First of all, Rubular is specific to Ruby's Regular Expression Semantics. (I don't have the exact information as to what is different between Ruby and Python's RegEx engines). Since you have tagged python, you might want to use regex101 or debuggex. I ll be using both these to explain.

Now, let us look at your actual RegEx and the data, here. Your input string is like this

476dn
e586
9999
rrr
ABCF

The input can be seen by Regular expression in two ways. A long string with newlines in it or a list of strings separated by newlines. We can control this behavior with a RegEx flag, which is known as multiline flag (In Python it is, re.MULTILINE or re.M). Quoting from the Python docs,

re.M

re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

For example, in our case, if this flag is NOT enabled, the input string will be treated as a long string with newlines in it and ^ will match the position before 4 in the first line, $ will match the position after F in the last line.

When that flag is enabled, then the ^ and $ will match the corresponding positions before and after the first and the last characters respectively. So, they can match the following

when ^ is the position before 4, $ will be the position after n
when ^ is the position before 4, $ will be the position after 6
when ^ is the position before 4, $ will be the position after 9
when ^ is the position before 4, $ will be the position after r
when ^ is the position before 4, $ will be the position after f

when ^ is the position before e, $ will be the position after 6
when ^ is the position before e, $ will be the position after 9
when ^ is the position before e, $ will be the position after r
when ^ is the position before e, $ will be the position after f

when ^ is the position before 9, $ will be the position after 9
when ^ is the position before 9, $ will be the position after r
when ^ is the position before 9, $ will be the position after f

when ^ is the position before r, $ will be the position after r
when ^ is the position before r, $ will be the position after F

when ^ is the position before A, $ will be the position after F

Since it can match multiple positions, we have to explicitly specify the RegEx engine that, we have to match each lines separately when we use multiline strings. In Python, we can use re.findall to re.finditer. In the RegEx world, it is normally represented with the flag g, search globally.

With this basic understanding, let us look at your data again. I believe rubular has got both these enabled, by default. We can see the matches clearly, with the capture group, like in this demo, with the RegEx

^([\D]*[0-9]+[\D]*)$

We can find the matches with Python, like this

regex = re.compile(r"^[\D]*[0-9]+[\D]*$", re.MULTILINE)
print regex.findall(data)
# ['476pe', 'e586', '9999\nrrr\nABCF']

The given pattern matches the first and the second lines, it should be trivial. But the third match might be difficult to understand at first. When we say ^[\D]*, it means that 0 or more characters which are not digits. So, an empty string can also be matched with [\D]*. So, at the beginning of 9999, [\D]* matches the empty string before 9999 and then [0-9]+ matches the digits 9999 and the rest of the string till the end will be matched by [\D]*. It matches the newlines as well because, \D anything but a digit. Since, a newline is not a digit, even that also has got matched.

Also note that \D allows other special characters as well. Quoting from the Docs,

When the UNICODE flag is not specified, matches any non-digit character; this is equivalent to the set [^0-9]. With UNICODE, it will match anything other than character marked as digits in the Unicode character properties database.

So, you might want to be more explicit like in tobias_k's answer

^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$

This can be used in Python, like this

regex = re.compile(r"^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$", re.MULTILINE)
print regex.findall(data)
# ['476pe', 'e586', '9999']

Or, if you can break the string into multiple strings, then you can do

regex = re.compile(r"^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$")
print [item for item in data.split() if regex.match(item)]
# ['476pe', 'e586', '9999']

+1 Great explanation. Not sure whether that's intentional, but this will not match 12abc34.
@tobias_k You are correct, Thanks for pointing that out. Edited my answer to use your RegEx :)

tobias_k · Accepted Answer · 2014-04-22 08:10:34Z

3

The problem with your regex is that \D can be anything except a number, so it will wrongly match strings with special characters in that position, and fail to match strings with more than one group of numbers.

Instead, try something like ^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$. This will match any numbers of numbers or letters, followed by a number, and again any number of numbers or letters.

And here's a demo...

edited Apr 22, 2014 at 8:10

answered Apr 22, 2014 at 7:57

tobias_k

83.1k12 gold badges130 silver badges186 bronze badges

Comments

aelor · Accepted Answer · 2014-04-22 08:04:07Z

3

if you are ONLY checking the match. \d will suffice

or use this :

[^a-zA-Z\n]

this will match first three but not with those words which have only alphabets.

demo here : http://regex101.com/r/dO1tI1

edited Apr 22, 2014 at 8:04

answered Apr 22, 2014 at 7:58

aelor

11.2k3 gold badges35 silver badges48 bronze badges

Comments

Roberto Reale · Accepted Answer · 2014-04-22 08:01:02Z

1

The simplest such expression is

[0-9]

This will match each string which contains at least one digit.

answered Apr 22, 2014 at 8:01

Roberto Reale

4,3271 gold badge19 silver badges22 bronze badges

2 Comments

aelor Over a year ago

awww happy answer, why have you escaped the + ??

aelor Over a year ago

you dont even need that +, if you look closer :)

Pedro Lobito · Accepted Answer · 2014-04-22 08:27:51Z

1

Try this instead:

(?im)^[a-z0-9]+$

EXAMPLE:

if re.search("^[a-z0-9]+$", subject, re.IGNORECASE | re.MULTILINE):
    # Successful match
else:
    # Match attempt failed

DEMO:

http://regex101.com/r/oG2wP1

answered Apr 22, 2014 at 8:27

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

3 Comments

tobias_k Over a year ago

This matches strings with no numbers, which is not what OP wanted.

Pedro Lobito Over a year ago

@tobias_k "...that will match the numbers and numbers with alphabets..."

tobias_k Over a year ago

"From the above strings i have to match only ..." ("numbers and numbers with alphabets", but no alphabets with no numbers)

Kei Minagawa · Accepted Answer · 2014-04-22 08:25:10Z

0

You might want this.

^(\D*\d+\D*)+$

Be careful \D is equivalent to [^0-9].

answered Apr 22, 2014 at 8:25

Kei Minagawa

4,5515 gold badges31 silver badges43 bronze badges

Collectives™ on Stack Overflow

Regular expression for matching numbers and strings

6 Answers 6

re.M

re.MULTILINE

2 Comments

Comments

Comments

2 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

re.M

re.MULTILINE

2 Comments

Comments

Comments

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related