Python pandas how to scan string contains by row?

Question

How do you scan if a pandas dataframe row contains a certain substring?

for example i have a dataframe with 11 columns all the columns contains names

ID    name1     name2       name3      ...    name10
-------------------------------------------------------
AA    AA_balls  AA_cakee1  AA_lavender ...   AA_purple
AD    AD_cakee  AD_cats    AD_webss    ...   AD_ballss
CS    CS_cakee  CS_cats    CS_webss    ...   CS_purble
.
.
.

I would like to get rows which contains, say "ball" in the dataframe and get the ID

so the result would be ID 'AA' and ID 'AD' since AA_balls and AD_ballss are in the rows.

I have searched on google but seems there is no specific result for these. people usually ask questions about searching substring in a specific columns but not all columns (a single row)

df[df["col_name"].str.contains("ball")]

The Methods I have thought of are as follows, you can skip this if you have little time:

(1) loop through the columns

for col_name in col_names:
     df.append(df[df[col_name].str.contains('ball')])

and then drop duplicates rows which have same ID values but this method would be very slow

(2) Make data frame to a 2 column dataframe by appending name2- name10 columns into one column and use df[df["concat_col"].str.contains("ball")]["ID] to get the IDs and drop duplicate

ID  concat_col   
AA    AA_balls 
AA    AA_cakeee
AA    AA_lavender
AA    AA_purple
 .
 .
 .
CS   CS_purble

(3) Use the dataframe like (2) to make a dictionay where

 dict[df["concat_col"].value] = df["ID"]

then get the

[value for key, value in programs.items() if 'ball' in key()]

but in this method i need to loop through dictionary and become slow

if there is a method that i can apply faster without these processes, i would prefer doing so. If anyone knows about this, would appreciate a lot if you kindly let me know:) Thanks!

not so big, df.shape is near (4000, 13) but i have done a lot of preprocessing in my programming process, would like to search for less time-consuimg methods — Winds
– Winds, Commented Mar 16, 2018 at 7:13
Hmmm, also timings depends of how many matches obviously - what do you think? 50% of rows? Or something else? — jezrael
– jezrael, Commented Mar 16, 2018 at 7:16
thanks for your answers below. let me try out and reply to you. the matches would be few, below 15 rows. — Winds
– Winds, Commented Mar 16, 2018 at 7:17

jezrael · Accepted Answer · 2018-03-16 07:44:39Z

1

One idea is use melt:

df = df.melt('ID')

a = df.loc[df['value'].str.contains('ball'), 'ID']
print (a)
0     AA
10    AD
Name: ID, dtype: object

Another:

df = df.set_index('ID')
a = df.index[df.applymap(lambda x: 'ball' in x).any(axis=1)]

Or:

mask = np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns])
a = df.loc[, 'ID']

Timings:

np.random.seed(145)
L = list('abcdefgh')
df = pd.DataFrame(np.random.choice(L, size=(4000, 10)))
df.insert(0, 'ID', np.arange(4000).astype(str))
a = np.random.randint(4000, size=15)
b = np.random.randint(1, 10, size=15)
for i, j in zip(a,b):
    df.iloc[i, j] = 'AB_ball_DE'
#print (df)


In [85]: %%timeit
    ...: df1 = df.melt('ID')
    ...: a = df1.loc[df1['value'].str.contains('ball'), 'ID']
    ...: 
10 loops, best of 3: 24.3 ms per loop

In [86]: %%timeit
    ...: df.loc[np.logical_or.reduce([df[x].str.contains('ball', regex=False) for x in df.columns]), 'ID']
    ...: 
100 loops, best of 3: 12.8 ms per loop

In [87]: %%timeit
    ...: df1 = df.set_index('ID')
    ...: df1.index[df1.applymap(lambda x: 'ball' in x).any(axis=1)]
    ...: 
100 loops, best of 3: 11.1 ms per loop

edited Mar 16, 2018 at 7:44

answered Mar 16, 2018 at 7:09

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Winds Over a year ago

Thanks your answer works very well!! did not know the method melt before. and need to learn more about lambda an applymap...

Kris · Accepted Answer · 2018-03-16 07:12:59Z

1

Maybe this might work?

mask = df.apply(lambda row: row.map(str).str.contains('word').any(), axis=1)
df.loc[mask]

Disclaimer: I haven't tested this. Perhaps the .map(str) isn't necessary.

answered Mar 16, 2018 at 7:12

Kris

23.9k3 gold badges32 silver badges37 bronze badges

Collectives™ on Stack Overflow

Python pandas how to scan string contains by row?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related