Looping over columns and column values for subsetting pandas dataframe

Question

I have a dataframe as follows df:

 ID color   finish  duration
    A1  black   smooth  12
    A2  white   matte   8
    A3  blue    smooth  20
    A4  green   matte   10
    B1  black   smooth  12
    B2  white   matte   8
    B3  blue    smooth   
    B4  green       10
    C1  black   smooth   
    C2  white   matte   8
    C3  blue    smooth   
    C4  green       10

I want to generate subsets of this dataframe based on certain conditions. For example, color= black, finish = smooth, duration = 12, I get the following dataframe.

ID  color   finish  duration    score
A1  black   smooth  12  1
B1  black   smooth  12  1

color= blue, finish = smooth, duration = 20, I get the following dataframe.

ID  color   finish  duration    score
A3  blue    smooth  20  1
B3  blue    smooth      0.666667
C3  blue    smooth      0.666667

Score is calculated as number of columns populated/total number of columns. I want to loop this in pandas dataframe. Following code is working for me for 2 columns.

list2 = list(df['color'].unique())
list3 = list(df['finish'].unique())

df_final = pd.DataFrame()


for i in range(len(list2)):
   for j in range(len(list3)):
       print 'Current Attribute Value:',list2[i],list3[j]

       gbl["df_"+list2[i]] = df[df.color == list2[i]]
       gbl["df_" + list2[i] + list3[j]] =          
       gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].finish == list3[j]]
       gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
       df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)

However, I am not able to loop this over column names. What I would like to do is,

lista = ['color','finish']

df_final = pd.DataFrame()
for a in range(len(lista)):
  for i in range(len(list2)):
    for j in range(len(list3)):
       print 'Current Attribute Value:',lista[a],list2[i],lista[a+1],list3[j]
       gbl["df_"+list2[i]] = df[df.lista[a] == list2[i]]
       gbl["df_" + list2[i] + list3[j]] = gbl["df_"+list2[i]].loc[gbl["df_"+list2[i]].lista[a+1] == list3[j]]
       gbl["df_" + list2[i] + list3[j]]['dfattribval'] = list2[i] + list3[j]
       df_final = df_final.append(gbl["df_" + list2[i] + list3[j]], ignore_index=True)

I get the obvious error -

AttributeError: 'DataFrame' object has no attribute 'lista'.

Anyone would know how to loop over column names and values. Thanks much in advance!

There are too many inconsistencies with this question. Do you mean rows when you say columns? number of columns populated/total number of columns. When you show color= blue, finish = smooth, duration = 20, you show 3 rows, two of which don't have duration 20. I'm lost as to how you need this problem solved. — piRSquared
– piRSquared, Commented Jul 20, 2016 at 18:02
Also, what is the true end result: separate dfs or one final appended df? A simple groupby() of color and finish can achieve your score without needing to separate dfs. — Parfait
– Parfait, Commented Jul 20, 2016 at 18:11
@piRSquared, On your first question, i am looking for number of columns or more specifically the cell populated. For example, in the second table duration is not populated for second and third row. Hence, the score is 2 / 3. On your second question, if you could imagine this as filtering in excel, i first filter by color, then by finish and finally by duration. Then i get the desired output. — Anu
– Anu, Commented Jul 21, 2016 at 2:34
Forgive me, not sure if you answered my question, but what is the desired output? Only one df or multiple dfs? You can group by those various dimensions without splitting. Please edit to show us. — Parfait
– Parfait, Commented Jul 21, 2016 at 12:17

Parfait · Accepted Answer · 2016-07-22 03:33:24Z

Not quite sure of your needs, but consider permuting your lists with a list comprehension to avoid the nested loops and use a dictionary of data frames. Possibly the scorecalc() apply function can be adjusted to fit your needs:

colorlist = list(df['color'].unique())
finishlist = list(df['finish'].unique())
durationlist = list(df['duration'].unique())

# ALL COMBINATIONS BETWEEN LISTS
allList = [(c,f, d) for c in colorlist for f in finishlist for d in durationlist]

def scorecalc(row):    
    row['score'] = row['duration'].count()
    return(row)

dfList = []; dfDict = {}
for i in allList:    
    # SUBSET DFS
    tempdf = df[(df['color'] == i[0]) & (df['finish']==i[1]) & (df['duration']==i[2])]

    if len(tempdf) > 0:  # FOR NON-EMPTY DFS
        print('Current Attribute Value:', i[0], i[1], i[2])
        tempdf = tempdf.groupby(['color','finish']).apply(scorecalc)        
        tempdf['score'] = tempdf['score'] / len(tempdf)
        print(tempdf)

        key = str(i[0]) + str(i[1]) + str(i[2])
        dfDict[key] = tempdf    # DICTIONARY OF DFS (USE pd.DataFrame(list(...)) FOR FINAL)
        dfList.append(tempdf)   # LIST OF DFS (USE pd.concat() FOR FINAL DF)

# Current Attribute Value: black smooth 12.0
#   ID  color  finish  duration  score
#0  A1  black  smooth      12.0    1.0
#4  B1  black  smooth      12.0    1.0
#Current Attribute Value: white matte 8.0
#   ID  color finish  duration  score
#1  A2  white  matte       8.0    1.0
#5  B2  white  matte       8.0    1.0
#9  C2  white  matte       8.0    1.0
#Current Attribute Value: blue smooth 20.0
#   ID color  finish  duration  score
#2  A3  blue  smooth      20.0    1.0
#Current Attribute Value: green matte 10.0
#   ID  color finish  duration  score
#3  A4  green  matte      10.0    1.0

Collectives™ on Stack Overflow

Looping over columns and column values for subsetting pandas dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related