1

i have series of words as given below

476pe
e586
9999
rrr
ABCF

i have to write a regular expression that will match the numbers and numbers with alphabets.From the above strings i have to match only

476pe
e586
9999

I tried to write a regular expression as given below

^[\D]*[0-9]+[\D]*$

but it doesn't work. I tried this with an online regex tool, http://rubular.com/r/HQE2vG0pbu and it shows that the entire string is matching.

6
  • 1
    That regular expression, while inelegant, should work in principle (if your goal is to match any string that contains exactly one number (with any number of digits)). How are you using it? Commented Apr 22, 2014 at 7:53
  • In other words, should a string like "12a34" match or not? Also, do you really want to match strings like "(/3)!§" (which your regex currently matches)? Commented Apr 22, 2014 at 7:55
  • Also, what is your regex matching that you don't want it to match? What is it failing to match that you do want it to match? Commented Apr 22, 2014 at 7:56
  • Tim, i was trying it with an online regex tool, rubular.com/r/HQE2vG0pbu it shows all strings are matching Commented Apr 22, 2014 at 7:56
  • 1
    but it didn't works. What do you mean by that? Help us to help you. Commented Apr 22, 2014 at 7:57

6 Answers 6

5

Since other answers have given a lot of ways to solve your problem, let me try to explain the behavior you witnessed.

First of all, Rubular is specific to Ruby's Regular Expression Semantics. (I don't have the exact information as to what is different between Ruby and Python's RegEx engines). Since you have tagged , you might want to use regex101 or debuggex. I ll be using both these to explain.

Now, let us look at your actual RegEx and the data, here. Your input string is like this

476dn
e586
9999
rrr
ABCF

The input can be seen by Regular expression in two ways. A long string with newlines in it or a list of strings separated by newlines. We can control this behavior with a RegEx flag, which is known as multiline flag (In Python it is, re.MULTILINE or re.M). Quoting from the Python docs,

re.M

re.MULTILINE

When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

For example, in our case, if this flag is NOT enabled, the input string will be treated as a long string with newlines in it and ^ will match the position before 4 in the first line, $ will match the position after F in the last line.

When that flag is enabled, then the ^ and $ will match the corresponding positions before and after the first and the last characters respectively. So, they can match the following

  • when ^ is the position before 4, $ will be the position after n
  • when ^ is the position before 4, $ will be the position after 6
  • when ^ is the position before 4, $ will be the position after 9
  • when ^ is the position before 4, $ will be the position after r
  • when ^ is the position before 4, $ will be the position after f

  • when ^ is the position before e, $ will be the position after 6
  • when ^ is the position before e, $ will be the position after 9
  • when ^ is the position before e, $ will be the position after r
  • when ^ is the position before e, $ will be the position after f

  • when ^ is the position before 9, $ will be the position after 9
  • when ^ is the position before 9, $ will be the position after r
  • when ^ is the position before 9, $ will be the position after f

  • when ^ is the position before r, $ will be the position after r
  • when ^ is the position before r, $ will be the position after F

  • when ^ is the position before A, $ will be the position after F

Since it can match multiple positions, we have to explicitly specify the RegEx engine that, we have to match each lines separately when we use multiline strings. In Python, we can use re.findall to re.finditer. In the RegEx world, it is normally represented with the flag g, search globally.

With this basic understanding, let us look at your data again. I believe rubular has got both these enabled, by default. We can see the matches clearly, with the capture group, like in this demo, with the RegEx

^([\D]*[0-9]+[\D]*)$

We can find the matches with Python, like this

regex = re.compile(r"^[\D]*[0-9]+[\D]*$", re.MULTILINE)
print regex.findall(data)
# ['476pe', 'e586', '9999\nrrr\nABCF']

The given pattern matches the first and the second lines, it should be trivial. But the third match might be difficult to understand at first. When we say ^[\D]*, it means that 0 or more characters which are not digits. So, an empty string can also be matched with [\D]*. So, at the beginning of 9999, [\D]* matches the empty string before 9999 and then [0-9]+ matches the digits 9999 and the rest of the string till the end will be matched by [\D]*. It matches the newlines as well because, \D anything but a digit. Since, a newline is not a digit, even that also has got matched.

Also note that \D allows other special characters as well. Quoting from the Docs,

When the UNICODE flag is not specified, matches any non-digit character; this is equivalent to the set [^0-9]. With UNICODE, it will match anything other than character marked as digits in the Unicode character properties database.

So, you might want to be more explicit like in tobias_k's answer

^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$

This can be used in Python, like this

regex = re.compile(r"^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$", re.MULTILINE)
print regex.findall(data)
# ['476pe', 'e586', '9999']

Or, if you can break the string into multiple strings, then you can do

regex = re.compile(r"^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$")
print [item for item in data.split() if regex.match(item)]
# ['476pe', 'e586', '9999']
Sign up to request clarification or add additional context in comments.

2 Comments

+1 Great explanation. Not sure whether that's intentional, but this will not match 12abc34.
@tobias_k You are correct, Thanks for pointing that out. Edited my answer to use your RegEx :)
3

The problem with your regex is that \D can be anything except a number, so it will wrongly match strings with special characters in that position, and fail to match strings with more than one group of numbers.

Instead, try something like ^[0-9a-zA-Z]*[0-9][0-9a-zA-Z]*$. This will match any numbers of numbers or letters, followed by a number, and again any number of numbers or letters.

And here's a demo...

Comments

3

if you are ONLY checking the match. \d will suffice

or use this :

[^a-zA-Z\n]

this will match first three but not with those words which have only alphabets.

demo here : http://regex101.com/r/dO1tI1

Comments

1

The simplest such expression is

[0-9]

This will match each string which contains at least one digit.

2 Comments

awww happy answer, why have you escaped the + ??
you dont even need that +, if you look closer :)
1

Try this instead:

(?im)^[a-z0-9]+$

EXAMPLE:

if re.search("^[a-z0-9]+$", subject, re.IGNORECASE | re.MULTILINE):
    # Successful match
else:
    # Match attempt failed

DEMO:

http://regex101.com/r/oG2wP1

3 Comments

This matches strings with no numbers, which is not what OP wanted.
@tobias_k "...that will match the numbers and numbers with alphabets..."
"From the above strings i have to match only ..." ("numbers and numbers with alphabets", but no alphabets with no numbers)
0

You might want this.

^(\D*\d+\D*)+$

Be careful \D is equivalent to [^0-9].

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.