0

I have a dataframe that looks like this :

leid     run_seq     cp_id     products    currency     amount
101           1       201        A           YEN         345
102           2       201        B           INR         223
101           2       202        A           USD         845
102           3       201        C           USD         345
102           3       203        A           INR         747

Now I want to create another data frame (or may be rewrite the existing one) which has columns current and history along with the existing ones, that would look like :

leid     run_seq     current                                     History
101           1       {201:{A:{YEN:345}}}                          {}
102           2       {201:{B:{INR:223}}}                          {}
101           2       {202:{A:{USD:845}}}                          {201:{A:{YEN:345}}}
102           3       {201:{C:{USD:773}},203:{A:{INR:747 }         {201:{B:{INR:223}}}

To give context and explain the problem : run_seq can be treated as date, if run seq = 1 , its the first day and hence there could be no history for leid = 101, hence the empty dictionary. current entry refers to the entry on that particular run_seq.

For example : If leid 101 does two transactions on run_seq 1 then the current would be {201:{A:{YEN:345}}, 202:{B:{USD:INR}}} if there are two different cp id's corresponding to same leid on same run_seq. If the cp_ids are same for two particular leid and run_seq but buy different products then {201:{A:{YEN:345},B:{USD:828}}}; if same cp_id,on same run_seq same product and same then {201:{A:{YEN:345, USD:734}}};if same cp_id, product, currency for a particular leid and run_seq then add the amnt i.e {201:{A:{YEN:345, YEN:734}}}, the result would be {201:{A:{YEN:1079}}}

Hisotry for a particular leid at a given run_seq would be combination of all the posssible dictionaries for the all previous run_seq. For example : If run_seq = 5, history would be combination of all the nested dicts for run_seq = 1,2,3,4 for that particular leid on a run_seq.

Note that there should be only one unique leid on a particular run_seq in the output.

I have tried everything, but am not able to come up with a complete code. More to say, I cannot think where to start from ?

1

2 Answers 2

1

I exploited Pandas's apply function and customised Pandas's groupby function

(credit for customised Pandas's groupby: https://medium.com/@sean.turner026/applying-custom-functions-to-groupby-objects-in-pandas-61af58955569 )

I also modify your input a little bit to show some possible outcomes.

the code is shown below

# defined the table copied from your question

table = """
leid     run_seq     cp_id     products    currency     amount
101           1       201        A           YEN         345
102           1       201        A           IDR         900
102           2       201        B           INR         223
101           2       202        A           USD         845
102           3       201        C           USD         345
"""

import pandas as pd
import numpy as np

with open("stackoverflow.csv", "w") as f:
    f.write(table)

df = pd.read_csv("stackoverflow.csv", delim_whitespace=True)
df = df.sort_values(by = ["leid", "run_seq"]).reset_index(drop = True)
# assigned using pandas apply in axis = 1
df["current"] = df.apply(lambda x: {x["cp_id"]: {x["products"]: {x["currency"]: x["amount"]}}}, axis = 1)


# defining a function to merge dictionaries
def Merge(dict1, dict2): 
    res = {**dict1, **dict2} 
    return res 

# defining a customised cumulative function dictionary
def cumsumdict(data):

    current_dict = [{}]

    for i in range(1, data.shape[0]):
        cp_id = list(data["current"].iloc[i-1])[0]
        product = list(data["current"].iloc[i-1][cp_id])[0]
        currency = list(data["current"].iloc[i-1][cp_id][product])[0]
        if cp_id in current_dict[-1]:
            # merge cp_id using dictionary merge if exist in previous trx
            cp_merger = Merge(current_dict[-1][cp_id], data["current"].iloc[i-1][cp_id])
            appender = current_dict[-1]
            appender[cp_id] = cp_merger
            if product in current_dict[-1][cp_id]:
                # merge products using dictionary merge if exist in previous trx
                product_merger = Merge(current_dict[-1][cp_id][product], data["current"].iloc[i-1][cp_id][product])
                appender = current_dict[-1]
                appender[cp_id][product] = product_merger
                if currency in current_dict[-1][cp_id][product]:
                    # sum the currency value 
                    currency_merger = current_dict[-1][cp_id][product][currency] + data["current"].iloc[i-1][cp_id][product][currency]
                    appender = current_dict[-1]
                    appender[cp_id][product][currency] = currency_merger



        else:
            appender = Merge(current_dict[-1], data["current"].iloc[i-1])

        current_dict.append(appender)

    data["history"] = current_dict

    return data

df = df.groupby(["leid"]).apply(cumsumdict)
df = df[["leid", "run_seq", "current", "history"]]
print(df)

the function above will result to

  leid  run_seq                     current  \
0   101        1  {201: {'A': {'YEN': 345}}}   
3   101        2  {202: {'A': {'USD': 845}}}   
1   102        1  {201: {'A': {'IDR': 900}}}   
2   102        2  {201: {'B': {'INR': 223}}}   
4   102        3  {201: {'C': {'USD': 345}}}   

                                         history  
0                                             {}  
3                     {201: {'A': {'YEN': 345}}}  
1                                             {}  
2  {201: {'A': {'IDR': 900}, 'B': {'INR': 446}}}  
4  {201: {'A': {'IDR': 900}, 'B': {'INR': 446}}}  
Sign up to request clarification or add additional context in comments.

12 Comments

It gives the following error - KeyError: ('amount', 'occurred at index 0')
@AkashDubey in pandas apply, you need to add axis = 1
where df.apply(lambda x: ... x["amount"].., axis = 1)
Yes. I just copy pasted your code. It throws the same error.
Sorry. My fault. Working now. The column names were different. Thanks a ton.
|
0

Here is my solution, however the 'history' contains lists of dicts instead of dicts only.

import pandas as pd, numpy as np
import io

# My test data:
text="""leid  run_seq  cp_id products currency  amount
0      101.0      1.0  201.0        A      YEN   345.0
1      102.0      2.0  201.0        B      INR   223.0
2      101.0      2.0  202.0        A      USD   845.0
3      102.0      3.0  201.0        C      USD   345.0
4      101.0      1.0  201.0        A      YEN   100.0
5      101.0      1.0  203.0        B     EURO   200.0
6      101.0      1.0  203.0        C      AUD   300.0"""

df= pd.read_csv(io.StringIO(text),sep=r"\s+",engine="python").sort_values(["leid","run_seq"])
G= df.groupby(["leid","run_seq"],sort=False)

def mkdict(grp):
    # Out: {201:{A:{YEN:345}}}
    d_cpid={}
    for r in grp.itertuples():
        d_prod= d_cpid.setdefault(r.cp_id, {} )     # {201:{}
        d_curr= d_prod.setdefault(r.products,{})    # {201:{A:{}
        d_curr[r.currency]= d_curr.get(r.currency,0)+r.amount   # {201:{A:{YEN:

    return d_cpid

rslt= G.apply(lambda grp: mkdict(grp))
rslt= rslt.reset_index().rename(columns={0:"current"})

L=[]
G1= rslt.groupby("leid")
for key,grp in G1:
    L.append([])
    lv= grp["current"].values
    for i in range(1,len(lv)):
        L.append(lv[:i])

rslt["history"]= L

EDIT: Next try

import pandas as pd, numpy as np
import io

# My test data
text="""leid  run_seq  cp_id products currency  amount
0      101.0      1.0  201.0        A      YEN   345.0
1      102.0      2.0  201.0        B      INR   223.0
2      101.0      2.0  202.0        A      USD   845.0
3      102.0      3.0  201.0        C      USD   345.0
4      101.0      1.0  201.0        A      YEN   100.0
5      101.0      1.0  203.0        B      EUR   200.0
6      101.0      1.0  203.0        C      AUD   300.0
7      101.0      3.0  204.0        D      INR   400.0
8      101.0      2.0  203.0        B      EUR   155.0
"""

df= pd.read_csv(io.StringIO(text),sep=r"\s+",engine="python").sort_values(["leid","run_seq"])
G= df.groupby(["leid","run_seq"],sort=False)

# This function works on a groupby object, and returns list of tuples:
def mklist(grp):
    return [ (r.cp_id,r.products,r.currency,r.amount) for r in grp.itertuples()]

# It makes dictionary from a list of tuples:
def mkdict(lt):

    # Out: { {201:{A:{YEN:345}}}, ... }
    d_cpid={}
    for cpid,prod,curr,amnt in lt:
        d_prod= d_cpid.setdefault(cpid, {})    # {201:{}
        d_curr= d_prod.setdefault(prod,{})      # {201:{A:{}
        d_curr[curr]= d_curr.get(curr,0)+amnt   # {201:{A:{YEN:

    return d_cpid

rslt= G.apply(lambda grp: mklist(grp) )
rslt= rslt.reset_index().rename(columns={0:"current"})

L=[]
G1= rslt.groupby("leid")
for key,grp in G1:
    L.append([])
    lv= grp["current"].values
    for i in range(1,len(lv)):
        L.append( [t for l in lv[:i] for t in l] )

rslt["history"]= [ mkdict(l) for l in L ]
rslt["current"]= [ mkdict(l) for l in rslt.current.values ]

9 Comments

Is there anyway, the lists remain dictionaries. This is utmost important to me.
Also, the code isn't working as it should. The history for a particular leid, run_seq pair is the merged dictionary of all the current dicts for that particular leid for all the previous run_seq . So, for ` leid = 102` and run_seq = 4 the history for leid = 102. run_seq = 4 would be would be current_run_seq_1 + current_run_seq_2 + current_run_seq_3 of leid = 101
The current column is working fine but the history column is all messed up.
@AkashDubey See the edited code above. The first step is not to create dictionaries, but to create lists of tuples that are then converted into dicts.
This addresses the list to dict part. But still doesn't make the history part right. The history is completely messed up.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.