1

I have 2 sets of split data frames from a big data frame. Say for example,

    import pandas as pd, numpy as np

   np.random.seed([3,1415])
ind1 = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
col1 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df1  = pd.DataFrame(np.random.randint(10, size=(10, 7)), columns=col1,index=ind1)
ind2 = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l','N_l']
col2 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df2  = pd.DataFrame(np.random.randint(20, size=(8, 7)), columns=col2,index=ind2)

# Split the dataframes into two parts 
pc_1,pc_2   = np.array_split(df1, 2)
lnc_1,lnc_2 = np.array_split(df2, 2)

And now, I need to concatenate each split data frames from df1 (pc1, pc2) with each data frames from df2 (ln_1,lnc_2). Currently, I am doing it following,

# concatenate each split data frame pc1 with lnc1

pc1_lnc_1 =pd.concat([pc_1,lnc_1])
pc1_lnc_2 =pd.concat([pc_1,lnc_2])
pc2_lnc1  =pd.concat([pc_2,lnc_1])
pc2_lnc2  =pd.concat([pc_2,lnc_2])

On every concatenated data frame I need to run a correlation analysis function, for example,

correlation(pc1_lnc_1)

And I wanted to save the results separately, for example,

  pc1_lnc1=   correlation(pc1_lnc_1)
  pc1_lnc2=   correlation(pc1_lnc_2)
     ......

  pc1_lnc1.to_csv(output,sep='\t')

The question is if there is a way I can automate the above concatenation part, rather than coding it in every line using some sort of loop, currently for every concatenated data frame. I am separately running the function correlation. And I have a pretty long list of the split data frame.

3 Answers 3

3

You can loop over the split dataframes:

for pc in np.array_split(df1, 2):
    for lnc in np.array_split(df2, 2):
         print(correlation(pd.concat([pc,lnc])))
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer, I want to save the output of each concatenated data frame separately. Within the for loop it waits for all split data frames, hence it will take forever, to print the output. I have split the data frames into smaller ones due to this reason. I have updated the question
1

Here is another thought,

def correlation(data):
    # do some complex operation..
    return data

# {"pc_1" : split_1, "pc_2" : split_2}
pc = {f"pc_{i + 1}": v for i, v in enumerate(np.array_split(df1, 2))}
lc = {f"lc_{i + 1}": v for i, v in enumerate(np.array_split(df2, 2))}

for pc_k, pc_v in pc.items():
    for lc_k, lc_v in lc.items():
        # (pc_1, lc_1), (pc_1, lc_2) ..
        correlation(pd.concat([pc_v, lc_v])). \
            to_csv(f"{pc_k}_{lc_k}.csv", sep="\t", index=False)

# will create csv like pc_1_lc_1.csv, pc_1_lc_2.csv.. in the current working dir

3 Comments

The output is only printing the headings.
Thanks, it's printing the output now!
I have a question is there a way to run each of the split data frames in parallel, rather one after another? Currently, on the big data frame, it is taking quite a lot of time to print the output.
0

If you don't have your individual dataframes in an array (and assuming you have a nontrivial number of dataframes), the easiest way (with minimal code modification) would be to throw an eval in with a loop.

Something like

for counter in range(0,n):
    for counter2 in range(0:n);
        exec("pc{}_lnc{}=correlation(pd.concat([pc_{},lnc_{}]))".format(counter,counter2,counter,counter2))

        eval("pc{}_lnc{}.to_csv(filename,sep='\t')".format(counter,counter2)

The standard disclaimer around eval does still apply (don't do it because it's lazy programming practice and unsafe inputs could cause all kinds of problems in your code).

See here for more details about why eval is bad

edit Updating answer for updated question.

4 Comments

Thanks for the help. I want to save the output separately. Just as mentioned in the question. I do not see it is possible with this loop
@zhqiat If eval is bad why should it be even recommended ?? In above case eval is redundant.
@Sushanth Most times eval leads to all types of bugs, however it is included in the language for a reason (meaning occasionally it's an answer to the problem as written)
@zhqiat, what is range(0,n) in the script?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.