0

I have the following data in multiple columns:

    col1             col2                       col3
123456     ['mary','ralph', ''bob']     ['bob','sam']
456789     ['george','fred', susie']  ['ralph','mary', 'bob']
789123     ['mary', bob']             ['bob']

I eventually need a value_counts on each column. To get everything out of the lists, I am trying explode. I can get the values into their columns post-explode, no problem. But, those of you who know about explode, know that my value_counts would then be inflated because of the repetition of values that explode causes when applying it to multiple columns

Explode yields this for example:

  col1     col2     col3
123456     mary     bob
123456     mary     sam     
123456     mary     george
123456     ralph    bob
123456     ralph    sam     
123456     ralph    george...etc.

Obviously, this throws off an accurate value_counts per column which is what I need. I have tried looping the explode over each column and then after each column explode matching the first col and the exploded column and remove duplicates, doesn't work. Always enjoying not being the smartest guy in the room (more to learn) so I sent this question to you pandas guru exploding with ideas. (see what I did there?). Thanks.

Expected output so that I can value_counts all columns except the col1 would be this:

123456    mary     bob
123456   ralph     sam
123456     bob  george
456789  george   ralph
456789    fred    mary
456789   susie     bob
789123    mary     bob
789123     bob  george
2
  • You do not need to explode Commented Nov 12, 2020 at 14:04
  • I am going to try the response below right now, but @DaniMesejo in place of explode, what is an option then? This is some learnin' going on here. Thanks. Commented Nov 12, 2020 at 14:05

6 Answers 6

3

I you want the value_counts of the elements inside the lists, first you need to flatten the column and then take the value_counts, for example:

import pandas as pd
from itertools import chain

df = pd.DataFrame(data=[
    [123456, ['mary', 'ralph', 'bob'], ['bob', 'sam', 'george']],
    [456789, ['george', 'fred', 'susie'], ['ralph', 'mary', 'bob']],
    [789123, ['mary', 'bob'], ['bob', 'george']]
], columns=['col1', 'col2', 'col3'])

print(pd.Series(chain.from_iterable(df['col2'])).value_counts())

Output

mary      2
bob       2
susie     1
george    1
fred      1
ralph     1
dtype: int64

The above result is the value_counts for col2 of your example.

Sign up to request clarification or add additional context in comments.

4 Comments

This option works for an individual column. Now, all I need to do is grab this series for each column and merge those series together to get the overall total for each of those value_counts values for the entire df.
Why not chain.from_iterable(df["col2"]) instead of unpacking the column into chain?
@CameronRiddell Nice catch! Fixed.
Here is what I did. I used @DaniMesejo answer with itertools and then created a df for every value_counts series from every column, put those dfs in a list, then merged all the dfs in the list and created a rightmost column summing across each row. That worked perfectly for what I needed.
3

IIUC you can apply instead of looping and explode:

print (df.set_index("col1").apply(pd.Series.explode))

          col2    col3
col1                  
123456    mary     bob
123456   ralph     sam
123456     bob  george
456789  george   ralph
456789    fred    mary
456789   susie     bob
789123    mary     bob
789123     bob  george

For uneven lists:

s = df.set_index("col1").agg("sum").to_frame().explode(0)

print (s.groupby(level=0)[0].apply(pd.Series.value_counts))

col2  mary      2
      bob       2
      george    1
      susie     1
      ralph     1
      john      1
      fred      1
col3  bob       3
      george    2
      sam       1
      ralph     1
      mary      1
Name: 0, dtype: int64

Or:

s = df.set_index("col1").agg("sum").to_frame().explode(0)

print (s.reset_index().groupby(["index", 0]).size().unstack(0))

0      bob  fred  george  mary  ralph  sam  susie
index                                            
col2   2.0   1.0     1.0   2.0    1.0  NaN    1.0
col3   3.0   NaN     2.0   1.0    1.0  1.0    NaN

4 Comments

What happens if one of the list of the columns have different size? Imagine for example you have ['mary', bob'], ['bob']?
That have to depend on what output OP wants from an uneven unnesting.
And, my data does contain columns with different list lengths. The above response yields this error: 'ValueError: cannot reindex from a duplicate axis'
Please add expected output. Does your result need to be grouped by "col1"?
1

You can try:

df.melt('col1').explode('value')\ #melt col2 and col3 into one column and explode
  .groupby(['variable','value'])\ #Groupby melted columns
  .count()['col1']\ #count
  .unstack(0, fill_value=0)  #reshape to show counts per col2 and col3 by name

Output:

variable  col2  col3
value               
bob          2     3
fred         1     0
george       1     0
mary         2     1
ralph        1     1
sam          0     1
susie        1     0

Comments

0

we can use stack to explode each list and then create a surrogate index using cumcount

# if not real lists you'll need `literal_eval
from ast import literal_eval

s = df.set_index('col1').stack().map(literal_eval).explode().to_frame()
df1 = s.set_index(s.groupby(level=[0,1]).cumcount(),append=True).unstack(1).droplevel(0,1)

print(df1)
            col2    col3
col1                    
123456 0    mary     bob
       1   ralph     sam
       2     bob  george
456789 0  george   ralph
       1    fred    mary
       2   susie     bob
789123 0    mary     bob
       1     bob  george

Comments

0

You can apply a function that takes each column, flatten's it and returns the value_counts of that column. Then replace the NaN values with 0 and cast the returned frame to integers to tidy up the output:

import pandas as pd
from pandas.core.common import flatten

def nested_valuecounts(series):
    flattened = list(flatten(series))
    return pd.Series.value_counts(flattened)

out = df[["col2", "col3"]].apply(nested_valuecounts).fillna(0).astype(int)

print(out)
        col2  col3
bob        1     3
fred       1     0
george     1     2
mary       2     1
ralph      1     1
sam        0     1
susie      1     0

Comments

-1

you can combine the columns before exploding

df['col4']=df['col2']+df['col3']
df.drop(columns = ['col2','col3'],inplace = True)

and then explode on 'col4'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.