0

So what I'm trying to do is create several calculations based on conditions and add them to a dataframe. Here is some code as an example:

data = pd.DataFrame({'Project':  ['Project 1', 'Project 2', ' Project 3', 'Project 4', 'Project 5', 'Project 6', 'Project 7', 'Project 8', 'Project 9', 'Project 10'],
        'Date': ['10/1/2020', '10/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '2/1/2021', '2/1/2021', '3/1/2021', '4/1/2021'],
       'Team': ['Team 1', 'Team 2', 'Team 3', 'Team 2', 'Team 2', 'Team 2', 'Team 1', 'Team 1', 'Team 1', 'Team 1'],
       'Timely': ['Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'Yes']})
                                       
df1=data.groupby('Team')['Timely'].apply(lambda x: (x=='Yes').sum()).reset_index(name='Yes Count')       

df1

Running this code will give me a table that shows the count of projects with timely = Yes for each team. Say I wanted to add a second column to the table to show the total count where timely = no and a third and fourth column to show the percent yes and percent no for each team. How would I do that?

2
  • 1
    Does this answer your question? Apply multiple functions to multiple groupby columns Commented Sep 28, 2020 at 23:06
  • Almost. I tried using the function in the apply example on that page: def f(x): d = {} d['Timely'] = data['Timely'].apply(lambda x: (x=='Yes').sum()) d['Untimely'] = data['Timely'].apply(lambda x: (x=='No').sum()) return pd.Series(d, index=['Timely', 'Untimely']) df.groupby('Team').apply(f) but got the following error: 'bool' object has no attribute 'sum' Commented Sep 28, 2020 at 23:52

1 Answer 1

1

Seems to me like you just want a crosstab. It's more straightforward to calculate using pd.crosstab than it is using a groupby/apply approach. Though pd.crosstab isn't the most efficient method of doing this, it works just fine.

This first snippet gets us the counts of "yes" and "no" for each team. We also get a "Total" column that is the sum of the "yes" + "no":

xtab = pd.crosstab(data["Team"], data["Timely"], margins=True, margins_name="Total")

print(xtab)
Timely  No  Yes  Total
Team                  
Team 1   1    4      5
Team 2   2    2      4
Team 3   1    0      1
Total    4    6     10

Now let's calculate the percentage breakdown of "No" and "Yes" by dividing those both by our "Total" column:

norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)

print(norm_xtab)
Timely   No  Yes
Team            
Team 1  0.2  0.8
Team 2  0.5  0.5
Team 3  1.0  0.0
Total   0.4  0.6

Then we can combine those two dataframes horizontally with the pd.concat function. We can also give names to each "subdataframe" e.g. one chunk of the dataframe will only be for count data, the other chunk will be for the normalized/percentage based data:

out = pd.concat([xtab, norm_xtab], keys=["count", "percent"], axis=1)
out = out.rename(columns={"No": "untimely", "Yes": "timely"}, level=1)


print(out)
          count               percent       
Timely untimely timely Total untimely timely
Team                                        
Team 1        1      4     5      0.2    0.8
Team 2        2      2     4      0.5    0.5
Team 3        1      0     1      1.0    0.0
Total         4      6    10      0.4    0.6

The result of the concatenation is a DataFrame with a multiindex as the columns. If you are unfamiliar with pd.MultiIndex, you can use this snippet to flatten your result into something you may be more familiar with:

out.columns = ["{}_{}".format(*colname) for colname in out.columns]

print(out)

        count_untimely  count_timely  count_Total  percent_untimely  percent_timely
Team                                                                               
Team 1               1             4            5               0.2             0.8
Team 2               2             2            4               0.5             0.5
Team 3               1             0            1               1.0             0.0
Total                4             6           10               0.4             0.6

To change the column names as in your comment below, you can simply run rename one more time:

out = out.rename(columns=lambda colname: colname.replace("count_", ""))

print(out)
        untimely  timely  Total  percent_untimely  percent_timely
Team                                                             
Team 1         1       4      5               0.2             0.8
Team 2         2       2      4               0.5             0.5
Team 3         1       0      1               1.0             0.0
Total          4       6     10               0.4             0.6

To convert the decimals to whole percentages, you'll need to multiply by 100, then either round to 0 decimal places OR use string formatting to trim the trailing decimals (I'll show you how to do both), and add another string formatting to get the "%" to appear.

percentages = (out
               .filter(like="percent")      # select columns that contain the string "percent"
               .mul(100)                    # Multiply everything by 100
               .round(0)                    # If you don't want to perform any rounding, get rid of this line
               .applymap("{:.0f}%".format)) # Apply string formatting to trim the decimal.

print(percentages)
       percent_untimely percent_timely
Team                                  
Team 1              20%            80%
Team 2              50%            50%
Team 3             100%             0%
Total               40%            60%

If you want these values reflect back in our out dataframe:

out.loc[:, percentages.columns] = percentages

print(out)
        untimely  timely  Total percent_untimely percent_timely
Team                                                           
Team 1         1       4      5              20%            80%
Team 2         2       2      4              50%            50%
Team 3         1       0      1             100%             0%
Total          4       6     10              40%            60%
Sign up to request clarification or add additional context in comments.

10 Comments

So when I try to create the second dataframe on my real dataset I get the following error: Unable to coerce to DataFrame, shape must be (8, 2): given (8, 1) Does this mean there are null values in the data set? And is there a limit on how many dataframes I can concatenate? There are two other metrics I need to include in this dataframe.
Can you show me the code that leads to the error in your case? I'm not certain which piece of code you're referring to
Try this instead, my pandas version is 1.1.1 so a version difference might result in an error, this approach should be a little more robust. `norm_xtab = xtab[["No", "Yes"]].div(xtab["Total"], axis=0)
Yep! Just updated my answer. You can use the rename(columns=...) method to do this. Note that I used rename twice, once right after using pd.concat and once at the end to strip away the "count_" prefix.
You can concatenate any number of dataframes together, all you need is a list of dataframes. If you use axis=1 it will stick them together horizontally, and axis=0 sticks them together vertically.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.