pandas explode avoid duplication of values

Question

I have the following data in multiple columns:

    col1             col2                       col3
123456     ['mary','ralph', ''bob']     ['bob','sam']
456789     ['george','fred', susie']  ['ralph','mary', 'bob']
789123     ['mary', bob']             ['bob']

I eventually need a value_counts on each column. To get everything out of the lists, I am trying explode. I can get the values into their columns post-explode, no problem. But, those of you who know about explode, know that my value_counts would then be inflated because of the repetition of values that explode causes when applying it to multiple columns

Explode yields this for example:

  col1     col2     col3
123456     mary     bob
123456     mary     sam     
123456     mary     george
123456     ralph    bob
123456     ralph    sam     
123456     ralph    george...etc.

Obviously, this throws off an accurate value_counts per column which is what I need. I have tried looping the explode over each column and then after each column explode matching the first col and the exploded column and remove duplicates, doesn't work. Always enjoying not being the smartest guy in the room (more to learn) so I sent this question to you pandas guru exploding with ideas. (see what I did there?). Thanks.

Expected output so that I can value_counts all columns except the col1 would be this:

123456    mary     bob
123456   ralph     sam
123456     bob  george
456789  george   ralph
456789    fred    mary
456789   susie     bob
789123    mary     bob
789123     bob  george

I am going to try the response below right now, but @DaniMesejo in place of explode, what is an option then? This is some learnin' going on here. Thanks. — John Taylor
– John Taylor, Commented Nov 12, 2020 at 14:05

Dani Mesejo · Accepted Answer · 2020-11-12 14:39:13Z

3

I you want the value_counts of the elements inside the lists, first you need to flatten the column and then take the value_counts, for example:

import pandas as pd
from itertools import chain

df = pd.DataFrame(data=[
    [123456, ['mary', 'ralph', 'bob'], ['bob', 'sam', 'george']],
    [456789, ['george', 'fred', 'susie'], ['ralph', 'mary', 'bob']],
    [789123, ['mary', 'bob'], ['bob', 'george']]
], columns=['col1', 'col2', 'col3'])

print(pd.Series(chain.from_iterable(df['col2'])).value_counts())

Output

mary      2
bob       2
susie     1
george    1
fred      1
ralph     1
dtype: int64

The above result is the value_counts for col2 of your example.

edited Nov 12, 2020 at 14:39

answered Nov 12, 2020 at 14:15

Dani Mesejo

62.2k6 gold badges56 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

John Taylor Over a year ago

This option works for an individual column. Now, all I need to do is grab this series for each column and merge those series together to get the overall total for each of those value_counts values for the entire df.

Cameron Riddell Over a year ago

Why not chain.from_iterable(df["col2"]) instead of unpacking the column into chain?

Dani Mesejo Over a year ago

@CameronRiddell Nice catch! Fixed.

John Taylor Over a year ago

Here is what I did. I used @DaniMesejo answer with itertools and then created a df for every value_counts series from every column, put those dfs in a list, then merged all the dfs in the list and created a rightmost column summing across each row. That worked perfectly for what I needed.

Henry Yik · Accepted Answer · 2020-11-12 14:33:00Z

3

IIUC you can apply instead of looping and explode:

print (df.set_index("col1").apply(pd.Series.explode))

          col2    col3
col1                  
123456    mary     bob
123456   ralph     sam
123456     bob  george
456789  george   ralph
456789    fred    mary
456789   susie     bob
789123    mary     bob
789123     bob  george

For uneven lists:

s = df.set_index("col1").agg("sum").to_frame().explode(0)

print (s.groupby(level=0)[0].apply(pd.Series.value_counts))

col2  mary      2
      bob       2
      george    1
      susie     1
      ralph     1
      john      1
      fred      1
col3  bob       3
      george    2
      sam       1
      ralph     1
      mary      1
Name: 0, dtype: int64

Or:

s = df.set_index("col1").agg("sum").to_frame().explode(0)

print (s.reset_index().groupby(["index", 0]).size().unstack(0))

0      bob  fred  george  mary  ralph  sam  susie
index                                            
col2   2.0   1.0     1.0   2.0    1.0  NaN    1.0
col3   3.0   NaN     2.0   1.0    1.0  1.0    NaN

edited Nov 12, 2020 at 14:33

answered Nov 12, 2020 at 13:58

Henry Yik

22.6k5 gold badges21 silver badges44 bronze badges

4 Comments

Dani Mesejo Over a year ago

What happens if one of the list of the columns have different size? Imagine for example you have ['mary', bob'], ['bob']?

Henry Yik Over a year ago

That have to depend on what output OP wants from an uneven unnesting.

John Taylor Over a year ago

And, my data does contain columns with different list lengths. The above response yields this error: 'ValueError: cannot reindex from a duplicate axis'

Henry Yik Over a year ago

Please add expected output. Does your result need to be grouped by "col1"?

Scott Boston · Accepted Answer · 2020-11-12 14:49:58Z

1

You can try:

df.melt('col1').explode('value')\ #melt col2 and col3 into one column and explode
  .groupby(['variable','value'])\ #Groupby melted columns
  .count()['col1']\ #count
  .unstack(0, fill_value=0)  #reshape to show counts per col2 and col3 by name

Output:

variable  col2  col3
value               
bob          2     3
fred         1     0
george       1     0
mary         2     1
ralph        1     1
sam          0     1
susie        1     0

answered Nov 12, 2020 at 14:49

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Comments

Umar.H · Accepted Answer · 2020-11-12 13:57:26Z

0

we can use stack to explode each list and then create a surrogate index using cumcount

# if not real lists you'll need `literal_eval
from ast import literal_eval

s = df.set_index('col1').stack().map(literal_eval).explode().to_frame()
df1 = s.set_index(s.groupby(level=[0,1]).cumcount(),append=True).unstack(1).droplevel(0,1)

print(df1)
            col2    col3
col1                    
123456 0    mary     bob
       1   ralph     sam
       2     bob  george
456789 0  george   ralph
       1    fred    mary
       2   susie     bob
789123 0    mary     bob
       1     bob  george

answered Nov 12, 2020 at 13:57

Umar.H

23.1k7 gold badges50 silver badges94 bronze badges

Comments

Cameron Riddell · Accepted Answer · 2020-11-12 14:47:35Z

0

You can apply a function that takes each column, flatten's it and returns the value_counts of that column. Then replace the NaN values with 0 and cast the returned frame to integers to tidy up the output:

import pandas as pd
from pandas.core.common import flatten

def nested_valuecounts(series):
    flattened = list(flatten(series))
    return pd.Series.value_counts(flattened)

out = df[["col2", "col3"]].apply(nested_valuecounts).fillna(0).astype(int)

print(out)
        col2  col3
bob        1     3
fred       1     0
george     1     2
mary       2     1
ralph      1     1
sam        0     1
susie      1     0

answered Nov 12, 2020 at 14:47

Cameron Riddell

13.8k14 silver badges21 bronze badges

Comments

piterbarg · Accepted Answer · 2020-11-12 14:01:29Z

-1

you can combine the columns before exploding

df['col4']=df['col2']+df['col3']
df.drop(columns = ['col2','col3'],inplace = True)

and then explode on 'col4'

answered Nov 12, 2020 at 14:01

piterbarg

8,2292 gold badges9 silver badges22 bronze badges

Collectives™ on Stack Overflow

pandas explode avoid duplication of values

6 Answers 6

4 Comments

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

4 Comments

4 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related