Aggregate list column in DataFrame with custom function

Question

Task

I would like to custom aggregate my DataFrame

import numpy as np
df = pd.DataFrame({'a': [1,1,1,2,2], 'b': [[(1,2,3),(4,5),(6,)],[(7,8),(9,10)],np.NaN,[(11,12),(13,)],np.NaN], 'c': [1,2,3,4,5]})

   a                          b  c
0  1  [(1, 2, 3), (4, 5), (6,)]  1
1  1          [(7, 8), (9, 10)]  2
2  1                        NaN  3
3  2          [(11, 12), (13,)]  4
4  2                        NaN  5

such that the lists in column b are extending each other per group. The result shall be

pd.DataFrame({'a': [1,2], 'b': [[(1,2,3),(4,5),(6,),(7,8),(9,10)],[(11,12),(13,)]], 'c': [6,9]})

   a                                           b  c
0  1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
1  2                           [(11, 12), (13,)]  9

Attempted Solution

I was going with

def mylistaggregator(l):
    return [item for sublist in l.tolist() for item in sublist]

df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

but get

TypeError: 'float' object is not iterable

and are not sure what the solution would be. I also tinkered around with lambda, but did not get anywhere.

Additional information

Running

types = []
for i in df.b:
    types.append(str(type(i)))
np.unique(types)

for my actual dataset returns

array(["<class 'float'>", "<class 'list'>"], 
      dtype='<U15')

How is that a bad question? It has a MWE and everything and I could not find the solution on the web. — Make42
– Make42, Commented Jun 16, 2017 at 12:33
Usually, that error implies that there are null values in the column. Null values in pandas are represented as floats. try df = df.fillna([]) so the null values can be processed the same as the non null values. — greg_data
– greg_data, Commented Jun 16, 2017 at 12:40
@user2583933: TypeError: "value" parameter must be a scalar or dict, but you passed a "list" — Make42
– Make42, Commented Jun 16, 2017 at 12:45

jezrael · Accepted Answer · 2017-06-16 12:39:20Z

1

You need filter out NaNs:

def mylistaggregator(l):
    return ([item for sublist in l.tolist() if isinstance(sublist,list) for item in sublist])

Or:

def mylistaggregator(l):
    return([item for subl in l.tolist() if not isinstance(subl, float) for item in subl])



df1 = df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

print (df1)
                                            b  c
a                                               
1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
2                           [(11, 12), (13,)]  9

Another solution is replace NaNs to []:

def mylistaggregator(l):
    return ([item for sublist in l.tolist() for item in sublist])

s = pd.Series([[]], index=df.index)
df['b'] = df['b'].combine_first(s)
#or
#df['b'] = df['b'].fillna(s)

df1 = df. \
    groupby('a', sort=False). \
    agg({'b': mylistaggregator,
         'c': 'sum'})

print (df1)
                                            b  c
a                                               
1  [(1, 2, 3), (4, 5), (6,), (7, 8), (9, 10)]  6
2                           [(11, 12), (13,)]  9

answered Jun 16, 2017 at 12:39

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Make42 Over a year ago

The solutions work on the example dataset, but on my real dataset I get TypeError: '<' not supported between instances of 'str' and 'tuple'. I added some additional information about the columns datatypes.

jezrael Over a year ago

Hmmm, hard answer, because I have no data returning error. Last solution works?

Make42 Over a year ago

No, none of the solutions work. Any idea how to debug? np.unique(df.b) is

array([[], [(8338, 8339)], [(8338, 8339, 8340)],        [(8338, 8339, 8340, 8341)], [(8338, 8339, 8340, 8341, 8343)],        [(8339, 8340)], [(8339, 8340, 8341)], [(8339, 8340, 8341, 8343)],        [(8340, 8341)], [(8340, 8341, 8343)], [(8341, 8343)]], dtype=object)

after fillna.

jezrael Over a year ago

One idea - Instead groupby('a', sort=False). use groupby(df['a'].astype(str), sort=False)., then is not necessary helper column.

jezrael Over a year ago

No, because it is list and pandas try automatically convert it to one item Series. Only works fillna or combine_first by another Series full of []. Pandas works, but things are complicated with lists or another nested structures.

|

Collectives™ on Stack Overflow

Aggregate list column in DataFrame with custom function

Task

Attempted Solution

Additional information

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Task

Attempted Solution

Additional information

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related