5

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:

col1 | col2
-----------
 x   | a,b


listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)

def test_for_pattern(x):
    if re.search(pattern, x):
        return pattern
    else:
        return x

#also can use col2.str.contains(pattern) for same results

The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.

Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:

def matching_func(file1, file2):
    file1 = pd.read_csv(fin)
    file2 = pd.read_excel(fin1, 0, skiprows=1)
    pattern = '|'.join(file1[col1].tolist())
    file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
                                             if re.search(pattern, x) else None)

I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:

df[col1].str.extract('(word1|word2)')

Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.

My final and preferred version with vectorized string method in pandas 0.13:

Using values from one column to extract from a second column:

df[col1].str.extract('({})'.format('|'.join(df[col2]))
1
  • 2
    Try re.search(pattern, x).group(0) instead Commented Mar 28, 2014 at 3:26

1 Answer 1

3

You might like to use extract, or one of the other vectorised string methods:

In [11]: s = pd.Series(['a', 'a,b'])

In [12]: s.str.extract('([cdfb])')
Out[12]:
0    NaN
1      b
dtype: object
Sign up to request clarification or add additional context in comments.

8 Comments

extract seems great. how would I use it though if I am getting the string matches from another dataframe column. In other words, for my function above I did '|'.join(df[col1].tolist()) to get my pattern.
any idea how I can get rid of this message from my program: UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning)
@prometheus2305 yup, put parentheses around what you're trying to find (as in my example) :)
@prometheus2305 a DataFrame column is just a Series, so you can do df[col1].str.extract('([cdfb])').
@prometheus2305 I think you're looking for '(%s)' % '|'.join(patterns) where patterns = ['word1', 'word2'] ?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.