pandas: searching for partial strings across multiple columns and outputting values for each row

Question

I have a dataframe like this one:

data =  {'fce1_1': ['K701', 'Molly', 'Tina', 'K876', 'Amy'], 
        'fce1_2': ['K712', 'Molly', 'K709', 'Jape', 'Amy'], 
        'fce2_1': ['K703', 'K719', 'Tina', 'I841', 'K987'],
        'fce2_2': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
df

fce1_1 fce1_2 fce2_1 fce2_2
K701   K712   K703   25
Molly  Molly  K719   94
Tina   K709   Tina   57
...etc

I would like to search each row of the df for any values starting with 'K' and return the value of 'K***' that is closest to the column at the right of the dataframe. For example:

fce1_1 fce1_2 fce2_1 fce2_2 new_col
K701   K712   K703   25     K703
Molly  Molly  K719   94     K719
Tina   K709   Tina   57     K709
...etc

Thanks.

What do you mean when you say that is closest to the column at the right? — Jan Zeiseweis
– Jan Zeiseweis, Commented May 4, 2016 at 12:37
I meant col fce2_2 will be prioritized over col fce2_1..etc — user
– user, Commented May 4, 2016 at 14:36

EdChum · Accepted Answer · 2016-05-04 14:18:06Z

3

You can apply a lambda on the df row-wise which checks whether the first character startswith 'K' and returns the last_valid_index which indexes that column on a row basis:

In [35]:
df['new_col'] = df.astype(str).apply(lambda x: x[x[x.str.startswith('K')].last_valid_index()], axis=1)
df

Out[35]:
  fce1_1 fce1_2 fce2_1  fce2_2 new_col
0   K701   K712   K703      25    K703
1  Molly  Molly   K719      94    K719
2   Tina   K709   Tina      57    K709
3   K876   Jape   I841      62    K876
4    Amy    Amy   K987      70    K987

Breakdown of the above:

In [38]:
df.astype(str).apply(lambda x: x.str.startswith('K'), axis=1)

Out[38]:
  fce1_1 fce1_2 fce2_1 fce2_2
0   True   True   True  False
1  False  False   True  False
2  False   True  False  False
3   True  False  False  False
4  False  False   True  False

In [39]:    
df.astype(str).apply(lambda x: x[x.str.startswith('K')].last_valid_index(), axis=1)

Out[39]:
0    fce2_1
1    fce2_1
2    fce1_2
3    fce1_1
4    fce2_1
dtype: object

EDIT

To handle no matches row-wise we can have a conditional condition inside the lambda:

In [67]:
data =  {'fce1_1': [np.NaN, 'Molly', 'Tina', 'K876', 'Amy'], 
        'fce1_2': [np.NaN, 'Molly', 'K709', 'Jape', 'Amy'], 
        'fce2_1': [np.NaN, 'K719', 'Tina', 'I841', 'K987'],
        'fce2_2': np.NaN}
df = pd.DataFrame(data)
df['new_col'] = df.astype(str).apply(lambda x: x[x.str.startswith('K')].last_valid_index() if x.str.startswith('K').any() else 'No Match', axis=1)
df

Out[67]:
  fce1_1 fce1_2 fce2_1  fce2_2   new_col
0    NaN    NaN    NaN     NaN  No Match
1  Molly  Molly   K719     NaN    fce2_1
2   Tina   K709   Tina     NaN    fce1_2
3   K876   Jape   I841     NaN    fce1_1
4    Amy    Amy   K987     NaN    fce2_1

edited May 4, 2016 at 14:18

answered May 4, 2016 at 12:38

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user Over a year ago

This works great on the test df but in my real data I am sometimes missing a K-value. .last_valid_index() appears to throw an error when this is the case. Is there an easy way to insert a np.NaN or other string value when there is no 'K***' value? Checked the pandas doc but limited info on the method.

EdChum Over a year ago

What do you want instead an empty string or something like 'No Match' as the entry?

user Over a year ago

Yes empty string or 'no match' is fine.

user Over a year ago

This is perfect. Thanks @EdChum !

Collectives™ on Stack Overflow

pandas: searching for partial strings across multiple columns and outputting values for each row

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related