1

I have a pandas dataframe with a column where I have to retrieve specific names. The only problem is, those names are not always at the same place and all the values of that columns do not have the same length, so I cannot use the split function . However, I have noticed that before those names, there is a always a combination of 4 to 7 digits. I believe it's the identifier for the name.
So how can I use regular expression to go through that column and retrieve the names I need. Here is a example from the jupyter notebook:

 df['info']
 csx_Gb009_broken screen_231400_Iphone 7
 000345_SamsungS8_tfes_Vodafone_is56t34_3G
 Ins45_56003_Huawei P8_

What I want is something like this:

 df['Phones']
 Iphone 7
 SamsungS8
 Huawei P8

I want to have something like the above knowing that those names come before a combination of 4 to 7 digits and end by an underscore.

0

1 Answer 1

1

You may use

df['Phones'] = df['info'].str.extract(r'\d{4}_([^_]+)')

The pattern matches:

  • \d{4} - 4 digits
  • _ - an underscore
  • ([^_]+) - Capturing group 1 (this value will be returned by str.extract): one or more chars other than _.

See the regex demo.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.