Python - Aggregating two rows with different operations for different columns

Question

I don't know where to start but I have data for two stock portfolios that I need to combine to represent one portfolio. Below is the dataframe that I'm starting with and also that I want to end up with.

Here's the data I already have

rawdata = {'portfolio': ['port1', 'port2', 'port1', 'port2'],
        'portfolioname': ['portfolioone', 'portfoliotwo', 'portfolioone', 'portfoliotwo'],
        'date': ['04/12/2020', '04/12/2020', '04/12/2020', '04/12/2020'],
        'code': ['ABC', 'ABC', 'XYZ', 'XYZ'],
        'quantity': [2, 3, 10, 11],
        'price': [1.5, 1.5, 0.2, 0.2],
        'value': [3, 4.5, 2, 2.2],
        'weight': [.6, .67, .4, .328]}

df1 = pd.DataFrame(rawdata)

Here's the data that I want to create

finisheddata = {'portfolio': ['port3', 'port3'],
        'portfolioname': ['portfoliothree', 'portfoliothree'],
        'date': ['04/12/2020', '04/12/2020'],
        'code': ['ABC', 'XYZ'],
        'quantity': [5, 21],
        'price': [1.5, 0.2],
        'value': [7.5, 4.2],
        'weight': [.64, .36]}

df2 = pd.DataFrame(finisheddata)

So what I'm trying to do is to group the two portfolios together by 'code' where the 'portfolio' and 'portfolioname' are arbitary, 'date' is always the same for both portfolios, 'quantity' is a sum, 'price' is taken from either port1 or port2, 'value' is 'price' x 'quantity' and 'weight' is 'value' divided by the sum of the portfolio.

Thanks very very much.

How do you decide the name for 'port1' and 'port2' after they group? Similarly for portfolioname? — Akshay Sehgal
– Akshay Sehgal, Commented Dec 4, 2020 at 2:03
I have updated my answer. First is hardcoded value for portfolio and portfolio name columns after aggregation, the second one I have implemented the logic port5 + port6 = port11 and portfolioone + portfoliofive = portfoliosix. Currently works only for single digits and their sum, so beware. — Akshay Sehgal
– Akshay Sehgal, Commented Dec 4, 2020 at 2:26

Aaj Kaal · Accepted Answer · 2020-12-04 03:16:12Z

In order to keep the columns when using agg you can use 'first' as given below:

Code:

import pandas as pd

rawdata = {'portfolio': ['port1', 'port2', 'port1', 'port2'],
        'portfolioname': ['portfolioone', 'portfoliotwo', 'portfolioone', 'portfoliotwo'],
        'date': ['04/12/2020', '04/12/2020', '04/12/2020', '04/12/2020'],
        'code': ['ABC', 'ABC', 'XYZ', 'XYZ'],
        'quantity': [2, 3, 10, 11],
        'price': [1.5, 1.5, 0.2, 0.2],
        'value': [3, 4.5, 2, 2.2],
        'weight': [.6, .67, .4, .328]}

df1 = pd.DataFrame(rawdata)
print(df1, '\n')

finisheddata = {'portfolio': ['port3', 'port3'],
        'portfolioname': ['portfoliothree', 'portfoliothree'],
        'date': ['04/12/2020', '04/12/2020'],
        'code': ['ABC', 'XYZ'],
        'quantity': [5, 21],
        'price': [1.5, 0.2],
        'value': [7.5, 4.2],
        'weight': [.64, .36]}

df2 = pd.DataFrame(finisheddata) # Desired
print(df2, '\n')

df3 = df1.groupby(['code']).agg({'portfolio' : 'first',  'portfolioname' : 'first',  'date' : 'first', 'quantity': 'sum', 'price' : 'first', 'weight': 'mean'}).reset_index()
df3['value'] = df3.price * df3.quantity
df3 = df3[['portfolio', 'portfolioname', 'date', 'code', 'quantity', 'price', 'value', 'weight']]
df3['portfolio'] = df3['portfolioname'] = 'combined'
print(df3)

Output:

  portfolio portfolioname        date code  quantity  price  value  weight
0     port1  portfolioone  04/12/2020  ABC         2    1.5    3.0   0.600
1     port2  portfoliotwo  04/12/2020  ABC         3    1.5    4.5   0.670
2     port1  portfolioone  04/12/2020  XYZ        10    0.2    2.0   0.400
3     port2  portfoliotwo  04/12/2020  XYZ        11    0.2    2.2   0.328

  portfolio   portfolioname        date code  quantity  price  value  weight
0     port3  portfoliothree  04/12/2020  ABC         5    1.5    7.5    0.64
1     port3  portfoliothree  04/12/2020  XYZ        21    0.2    4.2    0.36

  portfolio portfolioname        date code  quantity  price  value  weight
0  combined      combined  04/12/2020  ABC         5    1.5    7.5   0.635
1  combined      combined  04/12/2020  XYZ        21    0.2    4.2   0.364

Paul Brennan · Accepted Answer · 2020-12-04 02:09:05Z

This is a touch inelegant but it shows you how to use groupby and then build a series of data. Then once the data is built move it into a dataframe. After most of the output data is assembled then use the output to work out the weight in dataframe.

data = []
for cname, dfsub in df1.groupby('code'):
    port = 'portx'
    portname = 'portnew'
    code = cname
    quant = dfsub.quantity.sum()
    date = dfsub.date.iloc[0]
    price = dfsub.price.iloc[0]
    value = quant * price
    data.append([port,portname,date,code,quant,price,value])
dfout = pd.DataFrame(data, columns=['portfolio', 'portfolioname', 'date', 'code', 'quantity', 'price', 'value'])
sumval = dfout.value.sum()
dfout['weight'] = dfout['value'] / sumval

the output looks like

portfolio   portfolioname   date        code    quantity    price   value   weight
0   portx   portnew         04/12/2020  ABC     5           1.5     7.5    0.641026
1   portx   portnew         04/12/2020  XYZ     21          0.2     4.2    0.358974

If you want to reduce the number of digits in weight then dfout.round({'weight': 3}) to round it to 3 decimal places

Akshay Sehgal · Accepted Answer · 2020-12-04 02:20:58Z

You can simply define a dictionary with columns and corresponding aggregations and use agg() with groupby() to get what you need.

g = {'portfolio':lambda x:'portx',
     'portfolioname':lambda x:'portfoliox',
     'date':'first',
     'quantity':'sum',
     'price':'mean',
     'value':'sum',
     'weight':'mean'}

df1.groupby(['code']).agg(g).reset_index()

  code portfolio portfolioname        date  quantity  price  value  weight
0  ABC     portx    portfoliox  04/12/2020         5    1.5    7.5   0.635
1  XYZ     portx    portfoliox  04/12/2020        21    0.2    4.2   0.364

My confusion is with the portx and portfoliox. Right now I have hardcoded those, because you mention they are arbituary. Is there a logic to combining port1, port2 strings that you want to implement during aggregation? Let me know and I can update my answer accordingly.

EDIT: Aggregation over the portx and portfoliox

Since I didn't get a response from OP, here is the code for if you want to generate the portx and portfoliox based on existing values by aggregation -

word2int = {'one': 1, 
             'two': 2, 
             'three': 3, 
             'four': 4, 
             'five': 5, 
             'six': 6, 
             'seven': 7, 
             'eight': 8, 
             'nine': 9, 
             'zero' : 0}

int2word = {v:k for k,v in word2int.items()}

g = {'portfolio':lambda x: 'port'+str(sum([int(i[-1]) for i in x])),
     'portfolioname':lambda x: 'portfolio'+int2word.get(sum([word2int.get(i[9:]) for i in x])),
     'date':'first',
     'quantity':'sum',
     'price':'mean',
     'value':'sum',
     'weight':'mean'}

df1.groupby(['code']).agg(g).reset_index()


  code portfolio   portfolioname        date  quantity  price  value  weight
0  ABC     port3  portfoliothree  04/12/2020         5    1.5    7.5   0.635
1  XYZ     port3  portfoliothree  04/12/2020        21    0.2    4.2   0.364

Collectives™ on Stack Overflow

Python - Aggregating two rows with different operations for different columns

Here's the data I already have

Here's the data that I want to create

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Here's the data I already have

Here's the data that I want to create

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related