2

I have a dataframe organized in the following way

    var1   var2   var3   var4
0   A      23     B      7
1   B      13     C      4
2   C      12     A      11
3   A      5      C      15

I now want to create a new variable (column), var5, which takes the value of var2 if var1 == A and the value of var4 if var3 == A. For simplicity, var1 and var3 can never both have the value A. If neither var1 or var3 takes value A, then I want NaN. That is, the outcome in this example would be:

    var1   var2   var3   var4  var5
0   A      23     B      7     23
1   B      13     C      4     NaN
2   C      12     A      11    11
3   A      5      C      15    5

How can this be achieved?

2 Answers 2

4

Option 1
Sounds like you can use np.where for this -

i = df.var1 == 'A'
j = df.var3 == 'A'
df['var5'] = np.where(i, df.var2, np.where(j, df.var4, np.NaN))
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Option 2
An alternative would be np.select -

df['var5'] = np.select([i, j], [df.var2, df.var4], default=np.nan)
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Note, i and j are the same variables defined in the code listing for Option 1.


Option 3
pd.Series.mask/where

df.var2.mask(~i, df.var4.mask(~j, np.nan))

0    23.0
1     NaN
2    11.0
3     5.0
Name: var2, dtype: float64
Sign up to request clarification or add additional context in comments.

2 Comments

Great, the first two options work exactly as intended. As a beginner with pandas, I'm not sure I quite understand the third option.
@matnor Neither do I... now you know my secret ;-p
1

Throw out my simple yet might not be fast answer. (See the comments and other answers if aiming for performance.

df = pd.DataFrame([['A', 23, 'B', 7], ['B', '13', 'C', 4], 
                   ['c', 12, 'A', 11], ['A', 5, 'C', 15]],
                   columns=['v1', 'v2', 'v3', 'v4'])

def get_val(row):
    if row.v1 == 'A':
        return row.v2
    elif row.v3 == 'A':
        return row.v4
    else:
        return np.nan

df["v5"] = df.apply(get_val, axis=1)

What the code did is it defines a function to return a value based on each row. Use apply with the function.

enter image description here

5 Comments

apply is fine as a convenience function, but you'd really want to look at alternatives if you want performance. For example, you'll get a huge speedup just by dropping apply and using np.vectorize on this function, just because of the reduced overhead.
@cᴏʟᴅsᴘᴇᴇᴅ Due to function overhead?
Yes. Besides being a glorified loop, apply has many other overheads to the point that a simple for loop might be faster (because it operates at C speed).
@cᴏʟᴅsᴘᴇᴇᴅ Thanks for the information. Learned a lot of cool tricks from you.
@cᴏʟᴅsᴘᴇᴇᴅ Ohhh! Thank you Q_Q

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.