1

From the following data frame:

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

My ultimate goal is to extract the letters a, b or c (as string) in a pandas series. For that I am using the .findall() method from the re module, as shown below:

# import the module
import re
# define the patterns
pat = 'a|b|c'

# extract the patterns from the elements in the specified column
df['col1'].str.findall(pat)

The problem is that the output i.e. the letters a, b or c, in each row, will be present in a list (of a single element), as shown below:

Out[301]: 
0    [a]
1    [b]
2    [c]
3    [a]

While I would like to have the letters a, b or c as string, as shown below:

0    a
1    b
2    c
3    a

I know that if I combine re.search() with .group() I can get a string, but if I do:

df['col1'].str.search(pat).group()

I will get the following error message:

AttributeError: 'StringMethods' object has no attribute 'search'

Using .str.split() won't do the job because, in my original dataframe, I want to capture strings that might contain the delimiter (e.g. I might want to capture a-b)

Does anyone know a simple solution for that, perhaps avoiding iterative operations such as a for loop or list comprehension?

1

3 Answers 3

1

Use extract with capturing groups:

import pandas as pd

d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}

df = pd.DataFrame.from_dict(d)

result = df['col1'].str.extract('(a|b|c)')

print(result)

Output

   0
0  a
1  b
2  c
3  a
Sign up to request clarification or add additional context in comments.

Comments

0

Fix your code

pat = 'a|b|c'
df['col1'].str.findall(pat).str[0]
Out[309]: 
0    a
1    b
2    c
3    a
Name: col1, dtype: object

Comments

0

Simply try with str.split() like this- df["col1"].str.split("-", n = 1, expand = True)

import pandas as pd
d = {'col1':['a-1524112-124', 'b-1515', 'c-584854', 'a-15154']}
df = pd.DataFrame.from_dict(d)
df['col1'] = df["col1"].str.split("-", n = 1, expand = True) 
print(df.head())

Output:

  col1
0    a
1    b
2    c
3    a

4 Comments

In fact this does the job for this sample example, though, in my original df, splitting the observations won't do the job because what I want to capture might contain the - symbol. i.e. I also want to capture something like a-b
@BCArg then edit your question and let us know more about what are the possible value of your col1?
@BCArg how does df['col1'].str.findall(pat).str[0] capture a-b?
it will in case I specify that I want to capture it. In my original data frame I have a handful of parameters which I want to capture, hence I don't need to be sophisticated with regular expressions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.