0

*see edits below

I have a dataframe that contains 6 columns and I am using pandas and numpy to edit and work with the data.

id      calv1      calv2      calv3      calv4 
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29
2         NaT        NaT        NaT        NaT         
3  2006-08-29        NaT        NaT        NaT
4  2006-08-29 2007-08-29 2010-08-29        NaT
5  2006-08-29 2013-08-29        NaT        NaT

I want to create another column that counts the number of "calv" that occur for each id.

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2

Here is my last attempt:

nat = np.datetime64('NaT')

df.loc[
(df["calv1"] == nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 0
#1 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] == nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 1
#2 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] == nat) & (df["calv4"] == nat),
"no_calv"] = 2
#3 calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] == nat),
"no_calv"] = 3
#4 or more calvings
df.loc[
(df["calv1"] != nat) & (df["calv2"] != nat) &
(df["calv3"] != nat) & (df["calv4"] != nat),
"no_calv"] = 4

But the result is that the whole "no_calv" column is 4.0

I previously tried things like

..
(df["calv1"] != "NaT")
..

And

..
(df["calv1"] != pd.nat)
..

And the result was always 4.0 for the whole column or just NaN.

Any tips and tricks for a new python user?

*Edit: I got a great answer for just counting the sum but I realize now that I also want to take into an account if there are missing values in between other values (see row 6):

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2
6  2006-08-29        NaT 2013-08-29 2013-08-292     NaN #or some other value

This is why I was trying to be very clear with the criteria in my original example.

3
  • Just commenting to let you know, since this is an already answered question, you will probably get more attention/help if you make a new question with the additional information ('take into account if there're missing values in between') included from the getgo. Commented Jun 10, 2021 at 14:52
  • @dm2 Thank you very much for your help. Still figuring out the best way to use this great website. Should I delete this question since I've posted again and it's so similar? Commented Jun 10, 2021 at 15:04
  • I don't think there's any need to delete it. Yes, it's similar, but the additional requirement came up after it was marked as answered, and that changed the nature of the question (and in turn made the answers useful for the original question, but not for updated one). I am too still figuring this out, but you could ask in meta.stackoverflow.com what's the best course of action. Commented Jun 10, 2021 at 15:08

2 Answers 2

3

So long the values are datetime (and NaT are missing values, not string), you can use:

df['no_calv'] = df.notna().sum(axis = 1)

To get:

id      calv1      calv2      calv3      calv4 no_calv
1  2006-08-29 2007-08-29 2008-08-29 2009-08-29       4
2         NaT        NaT        NaT        NaT       0 
3  2006-08-29        NaT        NaT        NaT       1
4  2006-08-29 2007-08-29 2010-08-29        NaT       3
5  2006-08-29 2013-08-29        NaT        NaT       2

It checks for non-missing values and sums it up along the rows (axis = 1)

Sign up to request clarification or add additional context in comments.

8 Comments

Prefer notna rather than ~isna. +1
Good spot, I'm too used to isna, gonna add an edit @Corralien
Great, I'll try this. But what if there were more columns in the dataframe the I would want to exclude from this count ?
@Thordis you'd need to specify which columns to use or which columns to ignore. Assuming we're only interested in calv1 and calv2; selecting which ones to use: df['no_calv'] = df[['calv1','calv2']].notna().sum(axis = 1) ; selecting which ones to ignore: df['no_calv'] = df.drop(['calv3','calv4'], axis = 1).notna().sum(axis = 1)
@dm2 Another issue I didn't account for my example!! What if I want to take into account that the order of the calv1,calv2,calv3,calv4 has to be specific. As in... I don't want to count 3 if the value for calv1 is missing but values for calv2,calv3 and calv4 are there? Then I would want a NaN or some other type to tell me the row is not correct. That's why I was trying to specify the criteria in my original example.
|
1

You can do this with apply:

 def counting_fun(row):
     return len(row.dropna()) # what is the cnt of not pd.nat columns

 df['no_calv'] = df.apply(counting_fun, axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.