9

I am trying to find if a string exists across multiple columns. I would like to return a 1 if the string exists and 0 if it doesn't as a new series within the dataframe.

After searching the forums, I understand that str.contains could be used, but i'm searching over 100+ columns therefore it isn't efficient for me to work with individual series at a time.

There are some NAs within the columns if this is relevant.

Example simplified dataframe:

d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'], 
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'], 
'strings_4': ['AV', 'AB', 'BV']}
simple_df = pd.DataFrame(data=d)

If I am interested in finding 'AA' for example, I would like to return the following dataframe.

Example target dataframe:

d = {'strings_1': ['AA', 'AB', 'AV'], 'strings_2': ['BB', 'BA', 'AG'], 
'strings_1': ['AE', 'AC', 'AI'], 'strings_3': ['AA', 'DD', 'PP'], 
'strings_4': ['AV', 'AB', 'BV'], 'AA_TRUE': [1, 0, 0]}
target_df = pd.DataFrame(data=d)

Many thanks for help.

4
  • Do you have a list of strings ? You say ` 1 if the string exists` which string ? Commented Nov 8, 2017 at 13:44
  • In my example it would be AA, however yes there would be a list i'd be interested in. I'll likely assign them individually as i'd like multiple labels. Commented Nov 8, 2017 at 13:47
  • Next time add that in the question too. How can we guess you have a list that holds ['AA'] ? Commented Nov 8, 2017 at 13:47
  • Thanks will do (is it not obvious in the example?) Commented Nov 8, 2017 at 13:50

1 Answer 1

13

If need check mixed values - numeric with strings compare numpy array created by values, use DataFrame.any for check at least one True per row and last cast to int:

simple_df['new'] = (simple_df.values == 'AA').any(1).astype(int)
#or cast all values to string before comparing
#simple_df['new'] = (simple_df.astype(str)== 'AA').any(1).astype(int)
print (simple_df)
  strings_1 strings_2 strings_3 strings_4  new
0        AE        BB        AA        AV    1
1        AC        BA        DD        AB    0
2        AI        AG        PP        BV    0

Detail:

print ((simple_df.values == 'AA'))
[[False False  True False False]
 [False False False False False]
 [False False False False False]]

print ((simple_df.values == 'AA').any(1))
[ True False False]

If need check substring:

simple_df['new'] = simple_df.applymap(lambda x: 'G' in x).any(1).astype(int)
print (simple_df)
  strings_1 strings_2 strings_3 strings_4  new
0        AE        BB        AA        AV    0
1        AC        BA        DD        AB    0
2        AI        AG        PP        BV    1
Sign up to request clarification or add additional context in comments.

5 Comments

how to use two substrings at a time in entire data. like here using G ,can use E and G at a time ?
simple_df.applymap(lambda x: 'G|E' in x).any(1).astype(int) i writing this way but it not gives the output
@dondapati - need simple_df.applymap(lambda x: any(y in x for y in ['G','E']))
can we apply the above function with selected columns ? just apply the 1 and 2 columns
@dondapati - sure, use simple_df.iloc[:, :1].applymap(lambda x: any(y in x for y in ['G','E']))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.