Pandas - how to add multiple conditional columns to dataframe?

Question

So what I'm trying to do is create several calculations based on conditions and add them to a dataframe. Here is some code as an example:

data = pd.DataFrame({'Project':  ['Project 1', 'Project 2', ' Project 3', 'Project 4', 'Project 5', 'Project 6', 'Project 7', 'Project 8', 'Project 9', 'Project 10'],
        'Date': ['10/1/2020', '10/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '2/1/2021', '2/1/2021', '3/1/2021', '4/1/2021'],
       'Team': ['Team 1', 'Team 2', 'Team 3', 'Team 2', 'Team 2', 'Team 2', 'Team 1', 'Team 1', 'Team 1', 'Team 1'],
       'Timely': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']})
                                       
df1=data.groupby('Team')['Timely'].apply(lambda x: (x=='Yes').sum()).reset_index(name='Yes Count')       

df1

Running this code will give me a table that shows the count of projects with timely = Yes for each team. Say I wanted to add a second column to the table to show the total count where timely = no and a third and fourth column to show the percent yes and percent no for each team. How would I do that?

Does this answer your question? Apply multiple functions to multiple groupby columns — iff_or
– iff_or, Commented Sep 28, 2020 at 23:06
Almost. I tried using the function in the apply example on that page: def f(x): d = {} d['Timely'] = data['Timely'].apply(lambda x: (x=='Yes').sum()) d['Untimely'] = data['Timely'].apply(lambda x: (x=='No').sum()) return pd.Series(d, index=['Timely', 'Untimely']) df.groupby('Team').apply(f) but got the following error: 'bool' object has no attribute 'sum' — hockeyhorror
– hockeyhorror, Commented Sep 28, 2020 at 23:52

Cameron Riddell · Accepted Answer · 2020-10-01 16:08:20Z

1

Seems to me like you just want a crosstab. It's more straightforward to calculate using pd.crosstab than it is using a groupby/apply approach. Though pd.crosstab isn't the most efficient method of doing this, it works just fine.

This first snippet gets us the counts of "yes" and "no" for each team. We also get a "Total" column that is the sum of the "yes" + "no":

xtab = pd.crosstab(data["Team"], data["Timely"], margins=True, margins_name="Total")

print(xtab)
Timely  No  Yes  Total
Team                  
Team 1   1    4      5
Team 2   2    2      4
Team 3   1    0      1
Total    4    6     10

Now let's calculate the percentage breakdown of "No" and "Yes" by dividing those both by our "Total" column:

norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)

print(norm_xtab)
Timely   No  Yes
Team            
Team 1  0.2  0.8
Team 2  0.5  0.5
Team 3  1.0  0.0
Total   0.4  0.6

Then we can combine those two dataframes horizontally with the pd.concat function. We can also give names to each "subdataframe" e.g. one chunk of the dataframe will only be for count data, the other chunk will be for the normalized/percentage based data:

out = pd.concat([xtab, norm_xtab], keys=["count", "percent"], axis=1)
out = out.rename(columns={"No": "untimely", "Yes": "timely"}, level=1)


print(out)
          count               percent       
Timely untimely timely Total untimely timely
Team                                        
Team 1        1      4     5      0.2    0.8
Team 2        2      2     4      0.5    0.5
Team 3        1      0     1      1.0    0.0
Total         4      6    10      0.4    0.6

The result of the concatenation is a DataFrame with a multiindex as the columns. If you are unfamiliar with pd.MultiIndex, you can use this snippet to flatten your result into something you may be more familiar with:

out.columns = ["{}_{}".format(*colname) for colname in out.columns]

print(out)

        count_untimely  count_timely  count_Total  percent_untimely  percent_timely
Team                                                                               
Team 1               1             4            5               0.2             0.8
Team 2               2             2            4               0.5             0.5
Team 3               1             0            1               1.0             0.0
Total                4             6           10               0.4             0.6

To change the column names as in your comment below, you can simply run rename one more time:

out = out.rename(columns=lambda colname: colname.replace("count_", ""))

print(out)
        untimely  timely  Total  percent_untimely  percent_timely
Team                                                             
Team 1         1       4      5               0.2             0.8
Team 2         2       2      4               0.5             0.5
Team 3         1       0      1               1.0             0.0
Total          4       6     10               0.4             0.6

To convert the decimals to whole percentages, you'll need to multiply by 100, then either round to 0 decimal places OR use string formatting to trim the trailing decimals (I'll show you how to do both), and add another string formatting to get the "%" to appear.

percentages = (out
               .filter(like="percent")      # select columns that contain the string "percent"
               .mul(100)                    # Multiply everything by 100
               .round(0)                    # If you don't want to perform any rounding, get rid of this line
               .applymap("{:.0f}%".format)) # Apply string formatting to trim the decimal.

print(percentages)
       percent_untimely percent_timely
Team                                  
Team 1              20%            80%
Team 2              50%            50%
Team 3             100%             0%
Total               40%            60%

If you want these values reflect back in our out dataframe:

out.loc[:, percentages.columns] = percentages

print(out)
        untimely  timely  Total percent_untimely percent_timely
Team                                                           
Team 1         1       4      5              20%            80%
Team 2         2       2      4              50%            50%
Team 3         1       0      1             100%             0%
Total          4       6     10              40%            60%

edited Oct 1, 2020 at 16:08

answered Sep 28, 2020 at 22:28

Cameron Riddell

13.8k14 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

hockeyhorror Over a year ago

So when I try to create the second dataframe on my real dataset I get the following error: Unable to coerce to DataFrame, shape must be (8, 2): given (8, 1) Does this mean there are null values in the data set? And is there a limit on how many dataframes I can concatenate? There are two other metrics I need to include in this dataframe.

Cameron Riddell Over a year ago

Can you show me the code that leads to the error in your case? I'm not certain which piece of code you're referring to

Cameron Riddell Over a year ago

Try this instead, my pandas version is 1.1.1 so a version difference might result in an error, this approach should be a little more robust. `norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)

Cameron Riddell Over a year ago

Yep! Just updated my answer. You can use the rename(columns=...) method to do this. Note that I used rename twice, once right after using pd.concat and once at the end to strip away the "count_" prefix.

Cameron Riddell Over a year ago

You can concatenate any number of dataframes together, all you need is a list of dataframes. If you use axis=1 it will stick them together horizontally, and axis=0 sticks them together vertically.

|

Collectives™ on Stack Overflow

Pandas - how to add multiple conditional columns to dataframe?

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related