0

I have a dataframe like this

D_1  D_2   D_3    D_4
Boy                 
Boy  play       
Boy  play  car      
Boy  play  chess    
Boy  play  online 

now I would like to have 3 more columns L_2, L_3 and L_4 where I can add up data data from the first three columns based on levels so that eventually I can have the result df as :

D_1  D_2   D_3  D_4   L_2       L_3           L_4
Boy                   boy|emp   boy|emp|emp   boy|emp|emp|emp
Boy  play             boy|play  boy|play|emp  boy|play|emp|emp
Boy  play  car        boy|play  boy|play|car  boy|play|car|emp
Girl                  Girl|emp  Girl|emp|emp  Girl|emp|emp|emp

my solution from SQL looks like this

select *
    , concat(D_1,"|",ifnull(D_2, "emp")) as L_2  
    , concat(D_1,"|",ifnull(D_2, "emp"), "|", ifnull(D_3, "emp")) as L_3  
    , concat(D_1,"|",ifnull(D_2, "emp"), "|", ifnull(D_3, "emp"), "|", ifnull(D_4, "emp")) as L_4  
from abc

can anyone guide me how can i convert this in python scripting? Thanks in advance!

2
  • Why would you want this? Commented Jun 11, 2021 at 13:13
  • because I have a python script which is cleaning the file and pushing it to bigquery , I want to avoid using SQL and get the updated data directly from python script. Commented Jun 11, 2021 at 13:53

2 Answers 2

2

you can generalize the code for any number of columns like this:

for i in range(1, len(df.columns)):
    df['L_' + str(i+1)] = df[df.columns[:i+1]].fillna('emp').agg('|'.join, axis=1)

Output:

>>> print(df)
   D_1   D_2     D_3 D_4       L_2              L_3                  L_4
0  Boy                     Boy|emp      Boy|emp|emp      Boy|emp|emp|emp
1  Boy  play              Boy|play     Boy|play|emp     Boy|play|emp|emp
2  Boy  play     car      Boy|play     Boy|play|car     Boy|play|car|emp
3  Boy  play   chess      Boy|play   Boy|play|chess   Boy|play|chess|emp
4  Boy  play  online      Boy|play  Boy|play|online  Boy|play|online|emp

The whole code:

import pandas as pd
from io import StringIO

txt = '''D_1  D_2   D_3    D_4
Boy                 
Boy  play       
Boy  play  car      
Boy  play  chess    
Boy  play  online
'''

df = pd.read_csv(StringIO(txt), header=0, skipinitialspace=True, sep=r'\s+')

for i in range(1, len(df.columns)):
    df['L_' + str(i+1)] = df[df.columns[:i+1]].fillna('emp').agg('|'.join, axis=1)

df = df.fillna('')

print(df)
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, quick fix but if you see DF now we have extra 'emp' as well
@sdave I've edited so that you don't get the 'emp' in the original DataFrame
2

Replace "" will "emp" using Series.replace() then merge columns values using join() over iteration on columns

df = pd.DataFrame({"D_1":["Boy","Boy","Boy","Girl"],"D_2":["","play","play",""],"D_3":["","","car",""],"D_4":[""]*4})
temp = df.replace([''],'emp')
for c in range(1,len(temp.columns)):
    df[f'L_{c+1}'] = temp[temp.columns[:c+1]].astype(str).apply(lambda x: '|'.join(x), axis=1)
print(df)

    D_1  D_2    D_3   D_4     L_2           L_3              L_4
0   Boy                     Boy|emp     Boy|emp|emp     Boy|emp|emp|emp
1   Boy  play               Boy|play    Boy|play|emp    Boy|play|emp|emp
2   Boy  play   car         Boy|play    Boy|play|car    Boy|play|car|emp
3   Girl                    Girl|emp    Girl|emp|emp    Girl|emp|emp|emp

1 Comment

Thanks, I used your previous solution, the one you had before editing, as I wanted to define which columns to use. In real DF i have many more columns which I don't want to include here so you previous solution worked well for me :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.