pandas: Validating data frame cells using regular expressions

Question

I have a non-indexed data frame with over 50000 lines read from a csv file as follows:

John   Mullen  12/08/1993  Passw0rd
Lisa   Bush    06/12/1990  myPass12
Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

I would like to validate each cell of each row against a specific regular expression:

validate the password with the PassRegex below
validate First name/ last Name with the NameRegex below
etc...

and then move the rows where any of the cells do not validate to a new data frame.

import re
PassRegex = re.compile(r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$")
NameRegex = re.compile(r"^[a-zA-Z0-9\s\-]{2,80}$")

For example in this case, the below rows wouldn't validate with the PassRegex, so I want to move them to a separate data frame:

Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

Is there a way to do this without iterating through the whole data frame row by row, and cell by cell?

Any help is much appreciated.

EdChum · Accepted Answer · 2015-11-30 15:44:21Z

5

You can pass the regex to str.contains:

In [36]:
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
df[(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))]

Out[36]:
  first    last         dob  password
0  John  Mullen  12/08/1993  Passw0rd
1  Lisa    Bush  06/12/1990  myPass12

To only keep the rows of interest, this creates a boolean mask for each condition and uses & to and them together, you need parentheses due to operator precedence

The output from each condition:

In [37]:
df['password'].str.contains(passRegex, regex=True)

Out[37]:
0     True
1     True
2    False
3    False
Name: password, dtype: bool

In [38]:
df['first'].str.contains(nameRegex, regex=True)

Out[38]:
0    True
1    True
2    True
3    True
Name: first, dtype: bool

In [39]:
df['last'].str.contains(nameRegex, regex=True)

Out[39]:
0    True
1    True
2    True
3    True
Name: last, dtype: bool

And then when we combine them:

In [40]:
(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))

Out[40]:
0     True
1     True
2    False
3    False
dtype: bool

answered Nov 30, 2015 at 15:44

EdChum

397k204 gold badges837 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

forsaken Over a year ago

Hey @EdChum, sorry to be coming so late to this question, but if I want to get this regex from a text file and use it in my code, then how would we do that? Substitution with %s is not working for me :(

EdChum Over a year ago

@ManasJani Please post a new question with your raw data, code to recreate your df, the desired output ,and your attempts. Answering questions via comments is counter-productive

forsaken Over a year ago

Already posted this question a few days ago, stackoverflow.com/questions/53711128/…

Collectives™ on Stack Overflow

pandas: Validating data frame cells using regular expressions

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related