5

Given a list of animals, like:

animals = ['cat', 'dog', 'hamster', 'dolphin']

and pandas dataframe, df:

id    animals
1     dog,cat
2     dog
3     cat,dolphin
4     cat,dog
5     hamster,dolphin 

I want to get a new dataframe showing occurrence of each animal, something like:

animal    ids
cat       1,3,4
dog       1,2,4
hamster   5        
dolphin   3,5

I know I can run a loop and prepare it, but I have the list of over 80,000 words with dataframe of over 1 million rows, so it would take long to do it using loop. Is there an easier and faster method to get the result using dataframe?

3 Answers 3

5

Let us try get_dummies then dot

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
Out[307]: 
cat        1,3,4
dog        1,2,4
dolphin      3,5
hamster        5
dtype: object

If would considered the list add reindex

df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
Out[308]: 
cat        1,3,4
dog        1,2,4
hamster        5
dolphin      3,5
dtype: object
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you Ben for your answer, I appreciate. I've selected Andy's answer just because it's faster.
4

NumPy based one for perf. -

def list_occ(df):
    id_col='id'
    item_col='animals'
    
    sidx = np.argsort(animals)
    s = [i.split(',') for i in df[item_col]]
    d = np.concatenate(s)
    
    p = sidx[np.searchsorted(animals, d, sorter=sidx)]
    C = np.bincount(p, minlength=len(animals))
    
    l = list(map(len,s))
    r = np.repeat(np.arange(len(l)), l)
    v = df[id_col].values[r[np.lexsort((r,p))]]
    
    out = pd.DataFrame({'ids':np.split(v, C[:-1].cumsum())}, index=animals)
    return out

Sample run -

In [41]: df
Out[41]: 
  id          animals
0  1          dog,cat
1  2              dog
2  3      cat,dolphin
3  4          cat,dog
4  5  hamster,dolphin

In [42]: animals
Out[42]: ['cat', 'dog', 'hamster', 'dolphin']

In [43]: list_occ(df)
Out[43]: 
               ids
cat      [1, 3, 4]
dog      [1, 2, 4]
hamster        [5]
dolphin     [3, 5]

Benchmarking

Using the given sample and simply scale up the number of items.

# Setup
N = 100 # scale factor
s = [i.split(',') for i in df['animals']]
df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
df_big['id'] = range(1, len(df_big)+1)
animals = np.unique(np.concatenate(df_big.animals)).tolist()
df_big['animals'] = [','.join(i) for i in df_big.animals]
df = df_big

Timings -

# Using given df & scaling it up by replicating elems with progressive IDs
In [9]: N = 100 # scale factor
   ...: s = [i.split(',') for i in df['animals']]
   ...: df_big = pd.DataFrame({'animals':[[j+str(ID) for j in i] for i in s for ID in range(1,N+1)]})
   ...: df_big['id'] = range(1, len(df_big)+1)
   ...: animals = np.unique(np.concatenate(df_big.animals)).tolist()
   ...: df_big['animals'] = [','.join(i) for i in df_big.animals]
   ...: df = df_big

# @BEN_YO's soln-1
In [10]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1]
163 ms ± 2.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @BEN_YO's soln-2
In [11]: %timeit df.animals.str.get_dummies(',').T.dot(df.id.astype(str)+',').str[:-1].reindex(animals)
166 ms ± 4.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Andy L.'s soln
%timeit (df.astype(str).assign(animals=df.animals.str.split(',')).explode('animals').groupby('animals').id.agg(','.join).reset_index())
13.4 ms ± 74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit list_occ(df)
2.81 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

3 Comments

hands down, great answer, and great timing! thank you very much! one thing though, when I tried your answer in a notebook with my dataset, it caused to restart the kernel. I tried 5 times, and it's the same everytime.
@AhmetCetin That could be because we have millions of rows. Just to confirm that can you run it with a reduced dataset, say use df.iloc[:100], then df.iloc[:1000] and so on, until you see the point at which we start seeing that error?
Hi divakar, I'll do this test, just I'm out at the moment, but I've selected Andy's solution as it handles big dataset without a problem, and it's fast enough really. I really do appreciate your answer, I'd say it's definitely one of the best answer (full with comparison, performance test) I saw in SO. Thanks a million.
3

Use str.split, explode and agg.join

df_final = (df.astype(str).assign(animals=df.animals.str.split(','))
                          .explode('animals').groupby('animals').id.agg(','.join)
                          .reset_index())

Out[155]:
   animals     id
0      cat  1,3,4
1      dog  1,2,4
2  dolphin    3,5
3  hamster      5

3 Comments

This works really well, only one problem I noticed in my dataset. How can I trim spaces (both left, right) in animals column after splitting? I tried: df_final = (df.astype(str).assign(animals=df.animals.str.split(',').str.strip()) .explode('animals').groupby('animals').id.agg(','.join) .reset_index()) but it returned empty dataframe
nevermind about the comment above, i handled it before running your solution. Divakar's solution performs faster in small dataset it looks indeed, but it crashes when applied to a big set. I have the dataset of about 1.5 million rows, and your solution handles it without a problem and it's surely fast enough. Thanks a million for your help,
Hi Andy, if you may some time, can you check my other question here? stackoverflow.com/questions/63535547/… it's kinda continuation of this one, there are two answers, but both of them are crashing because of memory, as my dataset is quite big. as your solution here worked like a charm, I thought you may suggest something, only if it won't be a problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.