Create a variable in a Pandas dataframe based on information in the dataframe

Question

I have a dataframe organized in the following way

    var1   var2   var3   var4
0   A      23     B      7
1   B      13     C      4
2   C      12     A      11
3   A      5      C      15

I now want to create a new variable (column), var5, which takes the value of var2 if var1 == A and the value of var4 if var3 == A. For simplicity, var1 and var3 can never both have the value A. If neither var1 or var3 takes value A, then I want NaN. That is, the outcome in this example would be:

    var1   var2   var3   var4  var5
0   A      23     B      7     23
1   B      13     C      4     NaN
2   C      12     A      11    11
3   A      5      C      15    5

How can this be achieved?

cs95 · Accepted Answer · 2017-12-29 19:28:22Z

4

Option 1
Sounds like you can use np.where for this -

i = df.var1 == 'A'
j = df.var3 == 'A'

df['var5'] = np.where(i, df.var2, np.where(j, df.var4, np.NaN))
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Option 2
An alternative would be np.select -

df['var5'] = np.select([i, j], [df.var2, df.var4], default=np.nan)
df

  var1  var2 var3  var4  var5
0    A    23    B     7  23.0
1    B    13    C     4   NaN
2    C    12    A    11  11.0
3    A     5    C    15   5.0

Note, i and j are the same variables defined in the code listing for Option 1.

Option 3
pd.Series.mask/where

df.var2.mask(~i, df.var4.mask(~j, np.nan))

0    23.0
1     NaN
2    11.0
3     5.0
Name: var2, dtype: float64

edited Dec 29, 2017 at 19:28

answered Dec 29, 2017 at 19:19

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

matnor Over a year ago

Great, the first two options work exactly as intended. As a beginner with pandas, I'm not sure I quite understand the third option.

cs95 Over a year ago

@matnor Neither do I... now you know my secret ;-p

Tai · Accepted Answer · 2017-12-29 19:34:49Z

1

Throw out my simple yet might not be fast answer. (See the comments and other answers if aiming for performance.

df = pd.DataFrame([['A', 23, 'B', 7], ['B', '13', 'C', 4], 
                   ['c', 12, 'A', 11], ['A', 5, 'C', 15]],
                   columns=['v1', 'v2', 'v3', 'v4'])

def get_val(row):
    if row.v1 == 'A':
        return row.v2
    elif row.v3 == 'A':
        return row.v4
    else:
        return np.nan

df["v5"] = df.apply(get_val, axis=1)

What the code did is it defines a function to return a value based on each row. Use apply with the function.

edited Dec 29, 2017 at 19:34

answered Dec 29, 2017 at 19:19

Tai

8,0643 gold badges31 silver badges50 bronze badges

5 Comments

cs95 Over a year ago

apply is fine as a convenience function, but you'd really want to look at alternatives if you want performance. For example, you'll get a huge speedup just by dropping apply and using np.vectorize on this function, just because of the reduced overhead.

Tai Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Due to function overhead?

cs95 Over a year ago

Yes. Besides being a glorified loop, apply has many other overheads to the point that a simple for loop might be faster (because it operates at C speed).

Tai Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Thanks for the information. Learned a lot of cool tricks from you.

Tai Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Ohhh! Thank you Q_Q

Collectives™ on Stack Overflow

Create a variable in a Pandas dataframe based on information in the dataframe

2 Answers 2

2 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related