3
import pandas as pd

data = {
    "K": ["A", "A", "B", "B", "B"],
    "LABEL": ["X123", "X123", "X21", "L31", "L31"],
    "VALUE": [1, 3, 1, 2, 5.0]
}

df = pd.DataFrame.from_dict(data)

output = """
   K LABEL  VALUE
0  A   X12    1.0
1  A   X12    3.0
2  B   X21    1.0
3  B   L31    2.0
4  B   L31    5.0
"""

Transformation steps

For each group ( grouped by K ), find FINAL_VALUE defined below.

Where LABEL are or two types X__ and L__

# if LABEL is X___ then FINAL_VALUE = sum(VALUE)
# if LABEL is L___ then FINAL_VALUE = count(VALUE)
# else FINAL_VALUE = 0

Result of transformation

expected_output = """
K  LABEL  FINAL_VALUE
A    X12            4
B    X21            1
B    L31            2
"""

How can I achieve this using Pandas ?

EDIT1: Partially working

In [17]: df.groupby(["K", "LABEL"]).agg({"VALUE": {"VALUE_SUM": "sum", "VALUE_COUNT": "count"}})
Out[17]: 
              VALUE          
        VALUE_COUNT VALUE_SUM
K LABEL                      
A X12             2       4.0
B L31             2       7.0
  X21             1       1.0

EDIT2: Using reset_index() to fill up the dataframe

In [18]: df2 = df.groupby(["K", "LABEL"]).agg({"VALUE": {"VALUE_SUM": "sum", "VALUE_COUNT": "count"}})

In [21]: df2.reset_index()
Out[21]: 
   K LABEL       VALUE          
           VALUE_COUNT VALUE_SUM
0  A   X12           2       4.0
1  B   L31           2       7.0
2  B   X21           1       1.0

EDIT3: Final solution using df.apply()

In [59]: df3 = df2.reset_index()

In [60]: df3["FINAL_VALUE"] = df3.apply(lambda x: x["VALUE"]["VALUE_SUM"] if x["LABEL"].str.startswith("X").any() else x["VALUE"]["VALUE_COUNT"] , axis=1)

In [61]: df3[["K", "LABEL", "FINAL_VALUE"]]
Out[61]: 
   K LABEL FINAL_VALUE

0  A   X12         4.0
1  B   L31         2.0
2  B   X21         1.0
2
  • OK, I see that you already got the answer by yourself:) Commented Sep 8, 2016 at 14:41
  • @vlad.rad Not yet :-) I am almost there. I need to get the exact columns. Commented Sep 8, 2016 at 14:53

2 Answers 2

1

You could use DFGroupby.agg like you have done before followed by writing a generic function which computes the necessary requirements with the help of str.startswith and returns the required frame as shown:

def compute_multiple_condition(row):
    if row['LABEL'].startswith('X'):
        return row['sum']
    elif row['LABEL'].startswith('L'):
          return row['count']
    else:
        return 0

df = df.groupby(['K','LABEL'])['VALUE'].agg({'sum': 'sum', 'count': 'count'}).reset_index()
df['FINAL_VALUE'] = df.apply(compute_multiple_condition, axis=1).astype(int)
df = df[['K', 'LABEL', 'FINAL_VALUE']]
df

   K LABEL  FINAL_VALUE
0  A   X12            4
1  B   L31            2
2  B   X21            1
Sign up to request clarification or add additional context in comments.

3 Comments

For me startswith gave the error: AttributeError: ("'Series' object has no attribute 'startswith'", u'occurred at index 0'). But you gave the right approach.
You need to use it along with the str accessor like series.str.startswith(). In my function, it is calculated on individual strings and not on the entire series as such and hence the accessor isn't required.
Also, check the dtypes. LABEL column should be of type object for it to work.
0

you can try data frame chain:

result = (df.groupby(['K', 'LABEL'])
            .apply(lambda frame: frame.VALUE.sum() 
                                if frame.LABEL.iloc[0].startswith("X") else len(frame))
            .to_frame()
            .rename({'0': 'FINAL_VALUE'})
         )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.