Python and pandas: How to use df.loc to make a new column based on conditions?

Question

*see edits below

I have a dataframe that contains 6 columns and I am using pandas and numpy to edit and work with the data.

id      calv1      calv2      calv3      calv4 
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29
2         NaT        NaT        NaT        NaT         
3  2006-08-29        NaT        NaT        NaT
4  2006-08-29 2007-08-29 2010-08-29        NaT
5  2006-08-29 2013-08-29        NaT        NaT

I want to create another column that counts the number of "calv" that occur for each id.

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2

Here is my last attempt:

nat = np.datetime64('NaT')

df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4

But the result is that the whole "no_calv" column is 4.0

I previously tried things like

..
(df["calv1"] != "NaT")
..

And

..
(df["calv1"] != pd.nat)
..

And the result was always 4.0 for the whole column or just NaN.

Any tips and tricks for a new python user?

*Edit: I got a great answer for just counting the sum but I realize now that I also want to take into an account if there are missing values in between other values (see row 6):

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2
6  2006-08-29        NaT 2013-08-29 2013-08-292     NaN #or some other value

This is why I was trying to be very clear with the criteria in my original example.

Just commenting to let you know, since this is an already answered question, you will probably get more attention/help if you make a new question with the additional information ('take into account if there're missing values in between') included from the getgo. — dm2
– dm2, Commented Jun 10, 2021 at 14:52
@dm2 Thank you very much for your help. Still figuring out the best way to use this great website. Should I delete this question since I've posted again and it's so similar? — Thordis
– Thordis, Commented Jun 10, 2021 at 15:04
I don't think there's any need to delete it. Yes, it's similar, but the additional requirement came up after it was marked as answered, and that changed the nature of the question (and in turn made the answers useful for the original question, but not for updated one). I am too still figuring this out, but you could ask in meta.stackoverflow.com what's the best course of action. — dm2
– dm2, Commented Jun 10, 2021 at 15:08

dm2 · Accepted Answer · 2021-06-10 13:06:19Z

3

So long the values are datetime (and NaT are missing values, not string), you can use:

df['no_calv'] = df.notna().sum(axis = 1)

To get:

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2

It checks for non-missing values and sums it up along the rows (axis = 1)

edited Jun 10, 2021 at 13:06

answered Jun 10, 2021 at 13:03

dm2

4,3053 gold badges21 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Corralien Over a year ago

Prefer notna rather than ~isna. +1

dm2 Over a year ago

Good spot, I'm too used to isna, gonna add an edit @Corralien

Thordis Over a year ago

Great, I'll try this. But what if there were more columns in the dataframe the I would want to exclude from this count ?

dm2 Over a year ago

@Thordis you'd need to specify which columns to use or which columns to ignore. Assuming we're only interested in calv1 and calv2; selecting which ones to use: df['no_calv'] = df[['calv1','calv2']].notna().sum(axis = 1) ; selecting which ones to ignore: df['no_calv'] = df.drop(['calv3','calv4'], axis = 1).notna().sum(axis = 1)

Thordis Over a year ago

@dm2 Another issue I didn't account for my example!! What if I want to take into account that the order of the calv1,calv2,calv3,calv4 has to be specific. As in... I don't want to count 3 if the value for calv1 is missing but values for calv2,calv3 and calv4 are there? Then I would want a NaN or some other type to tell me the row is not correct. That's why I was trying to specify the criteria in my original example.

|

sophros · Accepted Answer · 2021-06-10 13:04:08Z

1

You can do this with apply:

 def counting_fun(row):
     return len(row.dropna()) # what is the cnt of not pd.nat columns

 df['no_calv'] = df.apply(counting_fun, axis=1)

answered Jun 10, 2021 at 13:04

sophros

17.3k12 gold badges52 silver badges84 bronze badges

Collectives™ on Stack Overflow

Python and pandas: How to use df.loc to make a new column based on conditions?

2 Answers 2

8 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related