2

I have a non-indexed data frame with over 50000 lines read from a csv file as follows:

John   Mullen  12/08/1993  Passw0rd
Lisa   Bush    06/12/1990  myPass12
Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

I would like to validate each cell of each row against a specific regular expression:

  • validate the password with the PassRegex below
  • validate First name/ last Name with the NameRegex below
  • etc...

and then move the rows where any of the cells do not validate to a new data frame.

import re
PassRegex = re.compile(r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$")
NameRegex = re.compile(r"^[a-zA-Z0-9\s\-]{2,80}$")

For example in this case, the below rows wouldn't validate with the PassRegex, so I want to move them to a separate data frame:

Maria  Murphy  30/03/1989  qwErTyUi
Seth   Black   21/06/1991  LoveXmas

Is there a way to do this without iterating through the whole data frame row by row, and cell by cell?

Any help is much appreciated.

1 Answer 1

5

You can pass the regex to str.contains:

In [36]:
passRegex = r"^(?!.*\s)(?=.*[A-Z])(?=.*[a-z])(?=.*\d).{8,50}$"
nameRegex = r"^[a-zA-Z0-9\s\-]{2,80}$"
df[(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))]

Out[36]:
  first    last         dob  password
0  John  Mullen  12/08/1993  Passw0rd
1  Lisa    Bush  06/12/1990  myPass12

To only keep the rows of interest, this creates a boolean mask for each condition and uses & to and them together, you need parentheses due to operator precedence

The output from each condition:

In [37]:
df['password'].str.contains(passRegex, regex=True)

Out[37]:
0     True
1     True
2    False
3    False
Name: password, dtype: bool

In [38]:
df['first'].str.contains(nameRegex, regex=True)

Out[38]:
0    True
1    True
2    True
3    True
Name: first, dtype: bool

In [39]:
df['last'].str.contains(nameRegex, regex=True)

Out[39]:
0    True
1    True
2    True
3    True
Name: last, dtype: bool

And then when we combine them:

In [40]:
(df['password'].str.contains(passRegex, regex=True)) & (df['first'].str.contains(nameRegex, regex=True)) & (df['last'].str.contains(nameRegex, regex=True))

Out[40]:
0     True
1     True
2    False
3    False
dtype: bool
Sign up to request clarification or add additional context in comments.

3 Comments

Hey @EdChum, sorry to be coming so late to this question, but if I want to get this regex from a text file and use it in my code, then how would we do that? Substitution with %s is not working for me :(
@ManasJani Please post a new question with your raw data, code to recreate your df, the desired output ,and your attempts. Answering questions via comments is counter-productive
Already posted this question a few days ago, stackoverflow.com/questions/53711128/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.