0

I have a dataframe in which each row shows one transaction and items within that transactions. Here is how my dataframe looks like

itemList
A,B,C
B,F
G,A
...

I want to find the frequency of each item (how many times it appeared in the transactions. I have defined a dictionary and try to update its value as shown below

dict ={}
def update(itemList):
   #Update the value of each item in the dict

df.itemList.apply(lambda x: update(x))

As apply function gets executed for multiple row at the same time, multiple rows try to update the values in dict at the same time and it's causing an issue. How can I make sure multiple updated to dict does not cause any issue?

3
  • 1
    Why do you think multiple rows try .. at the same time? apply is just a for loop. Commented Mar 11, 2020 at 20:19
  • As per this article, please provide a reproducible sample. By this I mean: a sample dataset we can copy/paste, the output of what you are getting, and a sample of what you want to have as output. Commented Mar 11, 2020 at 20:22
  • You don't need a lambda expression anymore. df.itemList.apply(update). Commented Mar 11, 2020 at 20:28

2 Answers 2

1

I think you only need Series.str.get_dummies:

df['itemList'].str.get_dummies(',').sum().to_dict()
#{'A': 2, 'B': 2, 'C': 1, 'F': 1, 'G': 1}

If there are more columns use:

df.stack().str.get_dummies(',').sum().to_dict()

if you want to count for each row:

df['itemList'].str.get_dummies(',').to_dict('index')
#{0: {'A': 1, 'B': 1, 'C': 1, 'F': 0, 'G': 0},
# 1: {'A': 0, 'B': 1, 'C': 0, 'F': 1, 'G': 0},
# 2: {'A': 1, 'B': 0, 'C': 0, 'F': 0, 'G': 1}}

As @Quang Hoang said in the comments apply simply apply the function to each row / column using a loop

Sign up to request clarification or add additional context in comments.

Comments

0

You might be better off relying on native python here,

df = pd.DataFrame({'itemlist':['a,b,c', 'b,f', 'g,a', 'd,g,f,d,s,a,v', 'e,w,d,f,g,h', 's,d,f,e,r,t', 'e,d,f,g,r,r','s,d,f']})

Here is a solution using Counter,

df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()

Some comparisons,

%timeit df['itemlist'].str.split(',', expand = True).stack().value_counts().to_dict()
2.64 ms ± 99.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df['itemlist'].str.get_dummies(',').sum().to_dict()
3.22 ms ± 68.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

from collections import Counter
%timeit df['itemlist'].str.replace(',','').apply(lambda x: Counter(x)).sum()
778 µs ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.