5

I am struggling to understand how df.apply()exactly works.

My problem is as follows: I have a dataframe df. Now I want to search in several columns for certain strings. If the string is found in any of the columns I want to add for each row where the string is found a "label" (in a new column).

I am able to solve the problem with map and applymap(see below).

However, I would expect that the better solution would be to use applyas it applies a function to an entire column.

Question: Is this not possible using apply? Where is my mistake?

Here are my solutions for using map and applymap.

df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])

Solution using map

def setlabel_func(column):
    return df[column].str.contains("A")

mask = sum(map(setlabel_func, ["h1","h5"]))
df.ix[mask==1,"New Column"] = "Label"

Solution using applymap

mask = df[["h1","h5"]].applymap(lambda el: True if re.match("A",el) else False).T.any()
df.ix[mask == True, "New Column"] = "Label"

For applyI don't know how to pass the two columns into the function / or maybe don't understand the mechanics at all ;-)

def setlabel_func(column):
    return df[column].str.contains("A")

df.apply(setlabel_func(["h1","h5"]),axis = 1)

Above gives me alert.

'DataFrame' object has no attribute 'str'

Any advice? Please note that the search function in my real application is more complex and requires a regex function which is why I use .str.contain in the first place.

2
  • What is your expected output? Commented Feb 11, 2017 at 11:53
  • Hi John, thanks for your response. My expected output is what solutions for map and applymap return. Sorry, I don't know how to paste in my output here? How do you do this? Commented Feb 11, 2017 at 12:00

4 Answers 4

7

Another solutions are use DataFrame.any for get at least one True per row:

print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')))
      h1     h5
0   True  False
1  False  False
2  False   True

print (df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1))
0     True
1    False
2     True
dtype: bool

df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A')).any(1),
                     'Label', '')

print (df)
  h1 h2 h3 h4 h5    new
0  A  B  C  D  Z  Label
1  E  A  G  H  Y       
2  I  J  K  L  A  Label

mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A')).any(1)
df.loc[mask, 'New'] = 'Label'
print (df)
  h1 h2 h3 h4 h5    New
0  A  B  C  D  Z  Label
1  E  A  G  H  Y    NaN
2  I  J  K  L  A  Label
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your swift reply. np.where is new to me. This will improve all my previous code significantly :-)
5

pd.DataFrame.apply iterates over each column, passing the column as a pd.Series to the function being applied. In you case, the function you're trying to apply doesn't lend itself to being used in apply

Do this instead to get your idea to work

mask = df[['h1', 'h5']].apply(lambda x: x.str.contains('A').any(), 1)
df.loc[mask, 'New Column'] = 'Label'

  h1 h2 h3 h4 h5 New Column
0  A  B  C  D  Z      Label
1  E  A  G  H  Y        NaN
2  I  J  K  L  A      Label

​

1 Comment

Great. Works just fine. Thanks for your swift reply.
3

IIUC you can do it this way:

In [23]: df['new'] = np.where(df[['h1','h5']].apply(lambda x: x.str.contains('A'))
                                             .sum(1) > 0,
                              'Label', '')

In [24]: df
Out[24]:
  h1 h2 h3 h4 h5    new
0  A  B  C  D  Z  Label
1  E  A  G  H  Y
2  I  J  K  L  A  Label

2 Comments

Thanks for your swift reply. np.where is new to me. This will improve all my previous code significantly :-)
@FredMaster, glad i could help :-)
0

Others have given good alternative methods. Here is a way to use apply 'row wise' (axis=1) to get your new column indicating presence of "A" for a bunch of columns.

If you are passed a row, you can just join the strings together into one big string and then use a string comparison ("in") see below. here I am combing all columns, but you can do it with just H1 and h5 easily.

df = pd.DataFrame([list("ABCDZ"),list("EAGHY"), list("IJKLA")], columns = ["h1","h2","h3","h4", "h5"])

def dothat(row):
    sep = ""
    return "A" in sep.join(row['h1':'h5'])
df['NewColumn'] = df.apply(dothat,axis=1)

This just squashes squashes each row into one string (e.g. ABCDZ) and looks for "A". This is not that efficient though if you just want to quit the first time you find the string then combining all the columns could be a waste of time. You could easily change the function to look column by column and quit (return true) when it finds a hit.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.