Grouping and aggregating pandas DataFrame to get a summary DataFrame

Question

I have the following detailed DataFrame:

source:

df_detailed = pd.DataFrame([
    ["Fail", "P1", "3 Failed Partition","X001, X002, X003"],
    ["Fail","P1","Late Backup","Late Backup"],
    ["Fail","P1","2 Failed Partition","X001, X002"],
    ["Fail","P2","2 Failed Partition","X001, X002"],
    ["Fail","P2","Late Backup","Late Backup"],
    ["Warn","P2","Huge Size","1GB"],
    ["Warn","P2","Huge Size","2GB"]
], columns = ["Severity", "Partition", "Status", "Comment"])

output:

  Severity Partition              Status           Comment
0     Fail        P1  3 Failed Partition  X001, X002, X003
1     Fail        P1         Late Backup       Late Backup
2     Fail        P1  2 Failed Partition        X001, X002
3     Fail        P2  2 Failed Partition        X001, X002
4     Fail        P2         Late Backup       Late Backup
5     Warn        P2           Huge Size               1GB
6     Warn        P2           Huge Size               2GB

I would like to group and aggregate this and get the below result:

Result:

  Partition                                     Status
0        P1          3 Failed Partition, 2 Late Backup
1        P2  2 Failed Partition, 1 Late Backup, 2 Warn

Note:

The keywords "Late Backup", "Failed Partition", "Huge Size" are static and would not change.
All severity with "Fail" should have granular information in the summary DataFrame.
All other severity like "Warning", "Info" ...etc should only contain the count of the Severity as put in expected result example
Failed Partition in the Detailed DataFrame is prefixed with the count of Failures, however in the Summary for each partition(i.e P1, P2) the count of the unique values of partitions need to be present in the summary DataFrame

Can someone please help, I've been sleepless with this for 2 days now :(

Yes correct, And if there are more Late Backups it should aggregate as 2 Late Backup / 3 Late Backup — MagnumCodus
– MagnumCodus, Commented Oct 17, 2019 at 4:52

Artiom Kozyrev · Accepted Answer · 2019-10-17 07:54:43Z

1

Thank you for interesting task, The problem is solved find the solution below and follow comments, feel free to ask questions.

import pandas as pd
from collections import Counter

df_detailed = pd.DataFrame([
    ["Fail", "P1", "3 Failed Partition", "X001, X002, X003"],
    ["Fail", "P1", "Late Backup", "Late Backup"],
    ["Fail", "P1", "2 Failed Partition", "X001, X002"],
    ["Fail", "P2", "2 Failed Partition", "X001, X002"],
    ["Fail", "P2", "Late Backup", "Late Backup"],
    ["Warn", "P2", "Huge Size", "1GB"],
    ["Warn", "P2", "Huge Size", "2GB"]
], columns=["Severity", "Partition", "Status", "Comment"])


def change_warn(severity, status):
    """To create a new column where we remove real Status with just Warn message"""
    if severity == "Warn":
        return "Warn"
    else:
        return status


df_detailed["Status"] = df_detailed.apply(lambda row: change_warn(row["Severity"], row["Status"]), axis=1)


def remove_leading_digits(x):
    if x[0].isdigit():
        x = " ".join(x.split(" ")[1:])
    return x


df_detailed["Status"] = df_detailed["Status"].apply(lambda x: remove_leading_digits(x))

df_detailed["Comment"] = df_detailed["Comment"].apply(lambda x: x + ",")  # we need it since we will sum the columns then

# need to combine to distinguish P1 from P2:
df_detailed["TempStatus"] = df_detailed["Partition"] + " " + df_detailed["Status"]

gr_b = df_detailed[["Partition", "TempStatus", "Comment"]].groupby("TempStatus").sum()


def calculate_unique_comment(status, comment):
    comments = []
    if status.endswith("Failed Partition"):
        for c in comment.split(","):
            if c != "":
                comments.append(c.strip())
        counter = Counter(comments)
        return str(len(counter.keys()))
    else:
        return str(0)


del gr_b["Partition"]  # do not need it

gr_b = gr_b.reset_index()  # otherwise get problem

gr_b["CountUnCom"] = gr_b.apply(lambda row: calculate_unique_comment(row["TempStatus"], row["Comment"]), axis=1)

# let's find of unique comments per Partion for Failed partition and put them in dict
part_dict = {}
for i in range(len(gr_b)):
    if gr_b["TempStatus"][i].endswith("Failed Partition"):
        part_dict[gr_b["TempStatus"][i]] = gr_b["CountUnCom"][i]


# let's take only what we need to work with
df_small = pd.DataFrame(df_detailed[["Partition", "Status"]])

df_small["Status"] = df_small["Status"].apply(lambda x: x + ",")  # to sum and split later

gr_df_small = df_small.groupby("Partition").sum()

gr_df_small = gr_df_small.reset_index()


def convert_status_to_list(status):
    new_status = []
    for c in status.split(","):
        if c != "":
            new_status.append(c.strip())
    return new_status


gr_df_small["Status"] = gr_df_small["Status"].apply(lambda x: convert_status_to_list(x))


def calculate_status(partition, status, x):
    result = []
    for k, v in Counter(status).items():
        if k == "Failed Partition":
            v = x[partition + " " + "Failed Partition"]
        result.append(f"{v} {k}")
    return " ".join(result)


gr_df_small["Status"] = gr_df_small.apply(lambda row: calculate_status(row["Partition"], row["Status"], part_dict),  axis=1)


print(gr_df_small)

Output:

  Partition                                   Status
0        P1         3 Failed Partition 1 Late Backup
1        P2  2 Failed Partition 1 Late Backup 2 Warn

edited Oct 17, 2019 at 7:54

answered Oct 16, 2019 at 20:14

Artiom Kozyrev

3,9062 gold badges18 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Artiom Kozyrev Over a year ago

@MagnumCodus I was able to solve the issue, do not forget to upvote it and mark as solution ;)

Artiom Kozyrev Over a year ago

@MagnumCodus hello, did you caheck the answer?

MagnumCodus Over a year ago

Sorry I couldn't check this earlier... But I am getting an error like this: result.append(f"{v} {k}") SyntaxError: invalid syntax

Artiom Kozyrev Over a year ago

@MagnumCodus I guess that you use Python version which do not know f strings, instead of f"{v} {k} use "{} {}".format(v, k) do not forgey space between two {} (not inside!)

MagnumCodus Over a year ago

brilliant it works!!!. Let me try it out on a couple of scenarios. Thank you very much!!! :) Which version of python are you using by the way?

|

Collectives™ on Stack Overflow

Grouping and aggregating pandas DataFrame to get a summary DataFrame

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest