Pandas - identifying dataframe values that start with value in a list

Question

Say I have the following dataframe:

>>> import pandas as pd
>>> d=pd.DataFrame()
>>> d['A']=['12345','12354','76','4']
>>> d['B']=['4442','2345','33','5']
>>> d['C']=['5553','4343','33','5']
>>> d
       A     B     C
0  12345  4442  5553
1  12354  2345  4343
2     76    33    33
3      4     5     5

And say I have 3 values of interest:

>>> vals=['123','76']

I am interested in determining which values in my dataframe start with any of the values in my list. There are 3 cases in my example: (0,A) starts with 123; (1,A) starts with 123; and (2,A) starts with 76.

Is there a way I can do this without looping through each of my values?

If I were interested in matching values exactly I could just do:

>>> d.isin(vals)
       A      B      C
0  False  False  False
1  False  False  False
2   True  False  False
3  False  False  False
>>>

And if I was interested in whether the values start with 1 particular value I could do:

>>> d.applymap(lambda x:x.startswith('123'))
       A      B      C
0   True  False  False
1   True  False  False
2  False  False  False
3  False  False  False
>>>

But how can I combine these two to find any value that starts with any value in my list?

you can do df.apply(lambda x: x.str.contains('|'.join(vals))) — EdChum
– EdChum, Commented Oct 14, 2016 at 21:05

EdChum · Accepted Answer · 2016-11-30 23:12:37Z

3

You can construct a regex pattern and test each column in turn using apply with a lambda calling str.contains:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.apply(lambda x: x.str.contains('|'.join(v)))

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

The resulting regex pattern:

In [10]:
'|'.join(v)

Out[10]:
'^123|^76'

update

Actually you can do this using stack and unstack so that you initially stack all columns into a single column, call str.contains with the regex pattern and then unstacking back to the original form:

In [9]:
vals=['123','76']
v = ['^' + x for x in vals]
d.stack().str.contains('|'.join(v)).unstack()

Out[9]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

This is a cleaner way of doing it compared to using apply

edited Nov 30, 2016 at 23:12

answered Oct 14, 2016 at 21:10

EdChum

397k204 gold badges837 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Julien Spronck · Accepted Answer · 2016-10-14 21:13:46Z

3

You could do this:

d.applymap(lambda x: any([x.startswith(v) for v in vals]))

answered Oct 14, 2016 at 21:13

Julien Spronck

15.5k5 gold badges50 silver badges57 bronze badges

Comments

MaxU - stand with Ukraine · Accepted Answer · 2016-10-14 21:15:05Z

1

Alternative solution, which doesn't use .apply():

In [66]: search_re = '^(?:{})'.format('|'.join(vals))

In [67]: search_re
Out[67]: '^(?:123|76)'

In [69]: df.astype(str).stack().str.match(search_re).unstack()
Out[69]:
       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

answered Oct 14, 2016 at 21:15

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Comments

jezrael · Accepted Answer · 2016-10-14 21:40:13Z

A bit complicated, but it seems the fastest solution.

You can use str.startswith which works only with Series, so use list comprehension and concat output. But need check more items in list vals, so use another list comprehension with numpy.logical_or with reduce which works with numpy array - convert output of concat and last create Dataframe with same column and index as original and data are output of numpy.logical_or:

print ([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals])
[array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool), array([[False, False, False],
       [False, False, False],
       [ True, False, False],
       [False, False, False]], dtype=bool)]

print (np.logical_or.reduce(
      [pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]))
[[ True False False]
 [ True False False]
 [ True False False]
 [False False False]]

print(pd.DataFrame(np.logical_or.reduce(
      [pd.concat([d[col].strstartswith(i) for col in d], 
                 axis=1).values for i in vals]), 
                   index=d.index, columns=d.columns))

       A      B      C
0   True  False  False
1   True  False  False
2   True  False  False
3  False  False  False

Timings:

#[40000 rows x 3 columns]
d = pd.concat([d]*10000).reset_index(drop=True)


In [77]: %timeit (d.applymap(lambda x: any([x.startswith(v) for v in vals])))
1 loop, best of 3: 228 ms per loop

In [78]: %timeit (d.apply(lambda x: x.str.contains('|'.join(['^' + x for x in vals]))))
10 loops, best of 3: 147 ms per loop

In [79]: %timeit (d.astype(str).stack().str.match('^(?:{})'.format('|'.join(vals))).unstack())
10 loops, best of 3: 172 ms per loop

In [80]: %timeit (pd.DataFrame(np.logical_or.reduce([pd.concat([d[col].str.startswith(i) for col in d], axis=1).values for i in vals]), index=d.index, columns=d.columns))
10 loops, best of 3: 116 ms per loop

Collectives™ on Stack Overflow

Pandas - identifying dataframe values that start with value in a list

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related