get list of occurrences using pandas

Question

Given a list of animals, like:

animals = ['cat', 'dog', 'hamster', 'dolphin']

and pandas dataframe, df:

id    animals
1     dog,cat
2     dog
3     cat,dolphin
4     cat,dog
5     hamster,dolphin

I want to get a new dataframe showing occurrence of each animal, something like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

I know I can run a loop and prepare it, but I have the list of over 80,000 words with dataframe of over 1 million rows, so it would take long to do it using loop. Is there an easier and faster method to get the result using dataframe?

BENY · Accepted Answer · 2020-08-21 20:22:07Z

5

Let us try get_dummies then dot

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
Out[307]: 
cat        1,3,4
dog        1,2,4
dolphin      3,5
hamster        5
dtype: object

If would considered the list add reindex

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
Out[308]: 
cat        1,3,4
dog        1,2,4
hamster        5
dolphin      3,5
dtype: object

answered Aug 21, 2020 at 20:22

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ahmet Cetin Over a year ago

Thank you Ben for your answer, I appreciate. I've selected Andy's answer just because it's faster.

Divakar · Accepted Answer · 2020-08-21 21:27:23Z

4

NumPy based one for perf. -

def list_occ(df):
    id_col='id'
    item_col='animals'
    
    sidx = np.argsort(animals)
    s = [i.split(',') for i in df[item_col]]
    d = np.concatenate(s)
    
    p = sidx[np.searchsorted(animals, d, sorter=sidx)]
    C = np.bincount(p, minlength=len(animals))
    
    l = list(map(len,s))
    r = np.repeat(np.arange(len(l)), l)
    v = df[id_col].values[r[np.lexsort((r,p))]]
    
    out = pd.DataFrame({'ids':np.split(v, C[:-1].cumsum())}, index=animals)
    return out

Sample run -

In [41]: df
Out[41]: 
  id          animals
0  1          dog,cat
1  2              dog
2  3      cat,dolphin
3  4          cat,dog
4  5  hamster,dolphin

In [42]: animals
Out[42]: ['cat', 'dog', 'hamster', 'dolphin']

In [43]: list_occ(df)
Out[43]: 
               ids
cat      [1, 3, 4]
dog      [1, 2, 4]
hamster        [5]
dolphin     [3, 5]

Benchmarking

Using the given sample and simply scale up the number of items.

# Setup
N = 100 # scale factor
s = [i.split(',') for i in df['animals']]
df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
df_big['id'] = range(1, len(df_big)+1)
animals = np.unique(np.concatenate(df_big.animals)).tolist()
df_big['animals'] = [','.join(i) for i in df_big.animals]
df = df_big

Timings -

# Using given df & scaling it up by replicating elems with progressive IDs
In [9]: N = 100 # scale factor
   ...: s = [i.split(',') for i in df['animals']]
   ...: df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
   ...: df_big['id'] = range(1, len(df_big)+1)
   ...: animals = np.unique(np.concatenate(df_big.animals)).tolist()
   ...: df_big['animals'] = [','.join(i) for i in df_big.animals]
   ...: df = df_big

# @BEN_YO's soln-1
In [10]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
163 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @BEN_YO's soln-2
In [11]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
166 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Andy L.'s soln
%timeit (df.astype(str).assign(animals=df.animals.str.split(',')).explode('animals').groupby('animals').id.agg(','.join).reset_index())
13.4 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit list_occ(df)
2.81 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

edited Aug 21, 2020 at 21:27

answered Aug 21, 2020 at 21:01

Divakar

222k19 gold badges273 silver badges374 bronze badges

3 Comments

Ahmet Cetin Over a year ago

hands down, great answer, and great timing! thank you very much! one thing though, when I tried your answer in a notebook with my dataset, it caused to restart the kernel. I tried 5 times, and it's the same everytime.

Divakar Over a year ago

@AhmetCetin That could be because we have millions of rows. Just to confirm that can you run it with a reduced dataset, say use df.iloc[:100], then df.iloc[:1000] and so on, until you see the point at which we start seeing that error?

Ahmet Cetin Over a year ago

Hi divakar, I'll do this test, just I'm out at the moment, but I've selected Andy's solution as it handles big dataset without a problem, and it's fast enough really. I really do appreciate your answer, I'd say it's definitely one of the best answer (full with comparison, performance test) I saw in SO. Thanks a million.

Andy L. · Accepted Answer · 2020-08-21 20:44:05Z

3

Use str.split, explode and agg.join

df_final = (df.astype(str).assign(animals=df.animals.str.split(','))
                          .explode('animals').groupby('animals').id.agg(','.join)
                          .reset_index())

Out[155]:
   animals     id
0      cat  1,3,4
1      dog  1,2,4
2  dolphin    3,5
3  hamster      5

answered Aug 21, 2020 at 20:44

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

3 Comments

Ahmet Cetin Over a year ago

This works really well, only one problem I noticed in my dataset. How can I trim spaces (both left, right) in animals column after splitting? I tried: df_final = (df.astype(str).assign(animals=df.animals.str.split(',').str.strip()) .explode('animals').groupby('animals').id.agg(','.join) .reset_index()) but it returned empty dataframe

Ahmet Cetin Over a year ago

nevermind about the comment above, i handled it before running your solution. Divakar's solution performs faster in small dataset it looks indeed, but it crashes when applied to a big set. I have the dataset of about 1.5 million rows, and your solution handles it without a problem and it's surely fast enough. Thanks a million for your help,

Ahmet Cetin Over a year ago

Hi Andy, if you may some time, can you check my other question here? stackoverflow.com/questions/63535547/… it's kinda continuation of this one, there are two answers, but both of them are crashing because of memory, as my dataset is quite big. as your solution here worked like a charm, I thought you may suggest something, only if it won't be a problem.

Collectives™ on Stack Overflow

get list of occurrences using pandas

3 Answers 3

1 Comment

Benchmarking

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Benchmarking

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related