construct new columns based on some condition from multiple columns and rows in a dataframe

Question

I have a dataframe that looks like this :

leid     run_seq     cp_id     products    currency     amount
101           1       201        A           YEN         345
102           2       201        B           INR         223
101           2       202        A           USD         845
102           3       201        C           USD         345
102           3       203        A           INR         747

Now I want to create another data frame (or may be rewrite the existing one) which has columns current and history along with the existing ones, that would look like :

leid     run_seq     current                                     History
101           1       {201:{A:{YEN:345}}}                          {}
102           2       {201:{B:{INR:223}}}                          {}
101           2       {202:{A:{USD:845}}}                          {201:{A:{YEN:345}}}
102           3       {201:{C:{USD:773}},203:{A:{INR:747 }         {201:{B:{INR:223}}}

To give context and explain the problem : run_seq can be treated as date, if run seq = 1 , its the first day and hence there could be no history for leid = 101, hence the empty dictionary. current entry refers to the entry on that particular run_seq.

For example : If leid 101 does two transactions on run_seq 1 then the current would be {201:{A:{YEN:345}}, 202:{B:{USD:INR}}} if there are two different cp id's corresponding to same leid on same run_seq. If the cp_ids are same for two particular leid and run_seq but buy different products then {201:{A:{YEN:345},B:{USD:828}}}; if same cp_id,on same run_seq same product and same then {201:{A:{YEN:345, USD:734}}};if same cp_id, product, currency for a particular leid and run_seq then add the amnt i.e {201:{A:{YEN:345, YEN:734}}}, the result would be {201:{A:{YEN:1079}}}

Hisotry for a particular leid at a given run_seq would be combination of all the posssible dictionaries for the all previous run_seq. For example : If run_seq = 5, history would be combination of all the nested dicts for run_seq = 1,2,3,4 for that particular leid on a run_seq.

Note that there should be only one unique leid on a particular run_seq in the output.

I have tried everything, but am not able to come up with a complete code. More to say, I cannot think where to start from ?

Is possible add more examples from For example paragraph to DataFrame for minimal, complete, and verifiable example? — jezrael
– jezrael, Commented Jan 4, 2020 at 9:02

N. Dani · Accepted Answer · 2020-01-04 14:29:08Z

1

I exploited Pandas's apply function and customised Pandas's groupby function

(credit for customised Pandas's groupby: https://medium.com/@sean.turner026/applying-custom-functions-to-groupby-objects-in-pandas-61af58955569 )

I also modify your input a little bit to show some possible outcomes.

the code is shown below

# defined the table copied from your question

table = """
leid     run_seq     cp_id     products    currency     amount
101           1       201        A           YEN         345
102           1       201        A           IDR         900
102           2       201        B           INR         223
101           2       202        A           USD         845
102           3       201        C           USD         345
"""

import pandas as pd
import numpy as np

with open("stackoverflow.csv", "w") as f:
    f.write(table)

df = pd.read_csv("stackoverflow.csv", delim_whitespace=True)
df = df.sort_values(by = ["leid", "run_seq"]).reset_index(drop = True)
# assigned using pandas apply in axis = 1
df["current"] = df.apply(lambda x: {x["cp_id"]: {x["products"]: {x["currency"]: x["amount"]}}}, axis = 1)


# defining a function to merge dictionaries
def Merge(dict1, dict2): 
    res = {**dict1, **dict2} 
    return res 

# defining a customised cumulative function dictionary
def cumsumdict(data):

    current_dict = [{}]

    for i in range(1, data.shape[0]):
        cp_id = list(data["current"].iloc[i-1])[0]
        product = list(data["current"].iloc[i-1][cp_id])[0]
        currency = list(data["current"].iloc[i-1][cp_id][product])[0]
        if cp_id in current_dict[-1]:
            # merge cp_id using dictionary merge if exist in previous trx
            cp_merger = Merge(current_dict[-1][cp_id], data["current"].iloc[i-1][cp_id])
            appender = current_dict[-1]
            appender[cp_id] = cp_merger
            if product in current_dict[-1][cp_id]:
                # merge products using dictionary merge if exist in previous trx
                product_merger = Merge(current_dict[-1][cp_id][product], data["current"].iloc[i-1][cp_id][product])
                appender = current_dict[-1]
                appender[cp_id][product] = product_merger
                if currency in current_dict[-1][cp_id][product]:
                    # sum the currency value 
                    currency_merger = current_dict[-1][cp_id][product][currency] + data["current"].iloc[i-1][cp_id][product][currency]
                    appender = current_dict[-1]
                    appender[cp_id][product][currency] = currency_merger



        else:
            appender = Merge(current_dict[-1], data["current"].iloc[i-1])

        current_dict.append(appender)

    data["history"] = current_dict

    return data

df = df.groupby(["leid"]).apply(cumsumdict)
df = df[["leid", "run_seq", "current", "history"]]
print(df)

the function above will result to

  leid  run_seq                     current  \
0   101        1  {201: {'A': {'YEN': 345}}}   
3   101        2  {202: {'A': {'USD': 845}}}   
1   102        1  {201: {'A': {'IDR': 900}}}   
2   102        2  {201: {'B': {'INR': 223}}}   
4   102        3  {201: {'C': {'USD': 345}}}   

                                         history  
0                                             {}  
3                     {201: {'A': {'YEN': 345}}}  
1                                             {}  
2  {201: {'A': {'IDR': 900}, 'B': {'INR': 446}}}  
4  {201: {'A': {'IDR': 900}, 'B': {'INR': 446}}}

edited Jan 4, 2020 at 14:29

answered Jan 4, 2020 at 13:12

N. Dani

112 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Akash Dubey Over a year ago

It gives the following error - KeyError: ('amount', 'occurred at index 0')

N. Dani Over a year ago

@AkashDubey in pandas apply, you need to add axis = 1

N. Dani Over a year ago

where df.apply(lambda x: ... x["amount"].., axis = 1)

Akash Dubey Over a year ago

Yes. I just copy pasted your code. It throws the same error.

Akash Dubey Over a year ago

Sorry. My fault. Working now. The column names were different. Thanks a ton.

|

kantal · Accepted Answer · 2020-01-05 10:57:32Z

0

Here is my solution, however the 'history' contains lists of dicts instead of dicts only.

import pandas as pd, numpy as np
import io

# My test data:
text="""leid  run_seq  cp_id products currency  amount
0      101.0      1.0  201.0        A      YEN   345.0
1      102.0      2.0  201.0        B      INR   223.0
2      101.0      2.0  202.0        A      USD   845.0
3      102.0      3.0  201.0        C      USD   345.0
4      101.0      1.0  201.0        A      YEN   100.0
5      101.0      1.0  203.0        B     EURO   200.0
6      101.0      1.0  203.0        C      AUD   300.0"""

df= pd.read_csv(io.StringIO(text),sep=r"\s+",engine="python").sort_values(["leid","run_seq"])
G= df.groupby(["leid","run_seq"],sort=False)

def mkdict(grp):
    # Out: {201:{A:{YEN:345}}}
    d_cpid={}
    for r in grp.itertuples():
        d_prod= d_cpid.setdefault(r.cp_id, {} )     # {201:{}
        d_curr= d_prod.setdefault(r.products,{})    # {201:{A:{}
        d_curr[r.currency]= d_curr.get(r.currency,0)+r.amount   # {201:{A:{YEN:

    return d_cpid

rslt= G.apply(lambda grp: mkdict(grp))
rslt= rslt.reset_index().rename(columns={0:"current"})

L=[]
G1= rslt.groupby("leid")
for key,grp in G1:
    L.append([])
    lv= grp["current"].values
    for i in range(1,len(lv)):
        L.append(lv[:i])

rslt["history"]= L

EDIT: Next try

import pandas as pd, numpy as np
import io

# My test data
text="""leid  run_seq  cp_id products currency  amount
0      101.0      1.0  201.0        A      YEN   345.0
1      102.0      2.0  201.0        B      INR   223.0
2      101.0      2.0  202.0        A      USD   845.0
3      102.0      3.0  201.0        C      USD   345.0
4      101.0      1.0  201.0        A      YEN   100.0
5      101.0      1.0  203.0        B      EUR   200.0
6      101.0      1.0  203.0        C      AUD   300.0
7      101.0      3.0  204.0        D      INR   400.0
8      101.0      2.0  203.0        B      EUR   155.0
"""

df= pd.read_csv(io.StringIO(text),sep=r"\s+",engine="python").sort_values(["leid","run_seq"])
G= df.groupby(["leid","run_seq"],sort=False)

# This function works on a groupby object, and returns list of tuples:
def mklist(grp):
    return [ (r.cp_id,r.products,r.currency,r.amount) for r in grp.itertuples()]

# It makes dictionary from a list of tuples:
def mkdict(lt):

    # Out: { {201:{A:{YEN:345}}}, ... }
    d_cpid={}
    for cpid,prod,curr,amnt in lt:
        d_prod= d_cpid.setdefault(cpid, {})    # {201:{}
        d_curr= d_prod.setdefault(prod,{})      # {201:{A:{}
        d_curr[curr]= d_curr.get(curr,0)+amnt   # {201:{A:{YEN:

    return d_cpid

rslt= G.apply(lambda grp: mklist(grp) )
rslt= rslt.reset_index().rename(columns={0:"current"})

L=[]
G1= rslt.groupby("leid")
for key,grp in G1:
    L.append([])
    lv= grp["current"].values
    for i in range(1,len(lv)):
        L.append( [t for l in lv[:i] for t in l] )

rslt["history"]= [ mkdict(l) for l in L ]
rslt["current"]= [ mkdict(l) for l in rslt.current.values ]

edited Jan 5, 2020 at 10:57

answered Jan 4, 2020 at 17:21

kantal

2,4072 gold badges10 silver badges16 bronze badges

9 Comments

Akash Dubey Over a year ago

Is there anyway, the lists remain dictionaries. This is utmost important to me.

Akash Dubey Over a year ago

Also, the code isn't working as it should. The history for a particular leid, run_seq pair is the merged dictionary of all the current dicts for that particular leid for all the previous run_seq . So, for ` leid = 102` and run_seq = 4 the history for leid = 102. run_seq = 4 would be would be current_run_seq_1 + current_run_seq_2 + current_run_seq_3 of leid = 101

Akash Dubey Over a year ago

The current column is working fine but the history column is all messed up.

kantal Over a year ago

@AkashDubey See the edited code above. The first step is not to create dictionaries, but to create lists of tuples that are then converted into dicts.

Akash Dubey Over a year ago

This addresses the list to dict part. But still doesn't make the history part right. The history is completely messed up.

|

Collectives™ on Stack Overflow

construct new columns based on some condition from multiple columns and rows in a dataframe

2 Answers 2

12 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

12 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related