1

I have a dataframe like this one:

data =  {'fce1_1': ['K701', 'Molly', 'Tina', 'K876', 'Amy'], 
        'fce1_2': ['K712', 'Molly', 'K709', 'Jape', 'Amy'], 
        'fce2_1': ['K703', 'K719', 'Tina', 'I841', 'K987'],
        'fce2_2': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data)
df

fce1_1 fce1_2 fce2_1 fce2_2
K701   K712   K703   25
Molly  Molly  K719   94
Tina   K709   Tina   57
...etc

I would like to search each row of the df for any values starting with 'K' and return the value of 'K***' that is closest to the column at the right of the dataframe. For example:

fce1_1 fce1_2 fce2_1 fce2_2 new_col
K701   K712   K703   25     K703
Molly  Molly  K719   94     K719
Tina   K709   Tina   57     K709
...etc

Thanks.

2
  • What do you mean when you say that is closest to the column at the right? Commented May 4, 2016 at 12:37
  • I meant col fce2_2 will be prioritized over col fce2_1..etc Commented May 4, 2016 at 14:36

1 Answer 1

3

You can apply a lambda on the df row-wise which checks whether the first character startswith 'K' and returns the last_valid_index which indexes that column on a row basis:

In [35]:
df['new_col'] = df.astype(str).apply(lambda x: x[x[x.str.startswith('K')].last_valid_index()], axis=1)
df

Out[35]:
  fce1_1 fce1_2 fce2_1  fce2_2 new_col
0   K701   K712   K703      25    K703
1  Molly  Molly   K719      94    K719
2   Tina   K709   Tina      57    K709
3   K876   Jape   I841      62    K876
4    Amy    Amy   K987      70    K987

Breakdown of the above:

In [38]:
df.astype(str).apply(lambda x: x.str.startswith('K'), axis=1)
​
Out[38]:
  fce1_1 fce1_2 fce2_1 fce2_2
0   True   True   True  False
1  False  False   True  False
2  False   True  False  False
3   True  False  False  False
4  False  False   True  False

In [39]:    
df.astype(str).apply(lambda x: x[x.str.startswith('K')].last_valid_index(), axis=1)

Out[39]:
0    fce2_1
1    fce2_1
2    fce1_2
3    fce1_1
4    fce2_1
dtype: object

EDIT

To handle no matches row-wise we can have a conditional condition inside the lambda:

In [67]:
data =  {'fce1_1': [np.NaN, 'Molly', 'Tina', 'K876', 'Amy'], 
        'fce1_2': [np.NaN, 'Molly', 'K709', 'Jape', 'Amy'], 
        'fce2_1': [np.NaN, 'K719', 'Tina', 'I841', 'K987'],
        'fce2_2': np.NaN}
df = pd.DataFrame(data)
df['new_col'] = df.astype(str).apply(lambda x: x[x.str.startswith('K')].last_valid_index() if x.str.startswith('K').any() else 'No Match', axis=1)
df

Out[67]:
  fce1_1 fce1_2 fce2_1  fce2_2   new_col
0    NaN    NaN    NaN     NaN  No Match
1  Molly  Molly   K719     NaN    fce2_1
2   Tina   K709   Tina     NaN    fce1_2
3   K876   Jape   I841     NaN    fce1_1
4    Amy    Amy   K987     NaN    fce2_1
Sign up to request clarification or add additional context in comments.

4 Comments

This works great on the test df but in my real data I am sometimes missing a K-value. .last_valid_index() appears to throw an error when this is the case. Is there an easy way to insert a np.NaN or other string value when there is no 'K***' value? Checked the pandas doc but limited info on the method.
What do you want instead an empty string or something like 'No Match' as the entry?
Yes empty string or 'no match' is fine.
This is perfect. Thanks @EdChum !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.