1

Starting DataFrame

df = pd.DataFrame({'Column A' : ['red','green','yellow', 'orange', 'red', 'blue'],
                   'Column B' : [NaN, 'blue', 'purple', NaN, NaN, NaN],
                   'Column C' : [1, 2, 3, 2, 3, 7]})
Column A Column B Column C
'red' NaN 1
'green' 'blue' 2
'yellow' 'purple' 3
'orange' NaN 2
'red' NaN 3
'blue' NaN 7

Desired Result

Column A Column B Column C
'red' NaN 1
'blue' 'blue' 2
'purple' 'purple' 3
'orange' NaN 2
'red' NaN 3
'blue' NaN 7

I want to replace values in column A only if the value in Column B is not NaN, and to replace column A with the value in Column B

So that I can run the following code:

df[[Column_A, Column_C]].groupby(Column_A).sum()

Which would result in the following DataFrame:

Column A Column C
'red' 4
'blue' 9
'purple' 3
'orange' 2

I am trying to replace categories before doing a groupby call.

Attempts:

The DataFrame I am working with has a sequential numerical based index going from 0 to N.
So I could hard code the following:
df.iloc[[index], column] = some_string
I do not want to do this as it is not dynamic and the DataFrame data could change.

I believe I could use .agg() or .apply() on either the df or the df.groupby() but this is where I have struggled.

Particularly with how to write a function to use with .agg() or .apply()

Say:

def my_func(x):
    print(x)

Then:
df.apply(my_func)
The result is the first column of df printed.
Or:
df.apply(my_func, axis = 1)

The result is the following format for each row:

Column A    red
Column B    Nan
Column C    1
Name: 0, dtype: object
Column A    green
Column B    blue
Column C    2
Name: 1, dtype: object

I am not sure how to access each column per row in my_func.

Edit:
I am trying to find a way to change the value in Column A if the value, for that row, in Column B is not NaN. The value to use for replacing is the value in Column B, the value to replace is the value in Column A if Column B is not NaN.

But I want to do this dynamically, meaning not hardcoded as I showed with:
df.iloc[[index], column] = some_string

3
  • 1
    Your question is not clear, do you want only the first step? df.loc[df['Column B'].notna(), 'Column A'] = df['Column B'] Commented Nov 18, 2022 at 18:41
  • @mozway I did an edit, apologies for not being clear. And thanks for your help. Commented Nov 18, 2022 at 19:03
  • 1
    Not sure what you mean by "dynamic", maybe df.assign(**{'Column A': lambda d: d['Column A'].mask(d['Column B'].notna(), d['Column B'])}) or df.assign(**{'Column A': lambda d: d['Column B'].fillna(d['Column A'])}) Commented Nov 18, 2022 at 19:34

1 Answer 1

1

As you mentioned, you could use pd.apply like this:

df['Column A'] = df.apply(lambda x: x['Column B'] if str(x['Column B']) not in ['nan', 'NaN'] else x['Column A'], axis=1)

  Column A Column B  Column C
0      red      NaN         1
1     blue     blue         2
2   purple   purple         3
3   orange      NaN         2
4      red      NaN         3
5     blue      NaN         7

Notice that apply is not fast at for very large dataset is not advisable. There are some good answers out there for alternative methods

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.