1

Is there a better way to do what the code below does in a (slow!) loop?

Using an input DataFrame, I want to convert it to a list of products each user has consumed. But this list will be up to the millions long and this seems quite inefficient (unless I use cython). Any ideas how to make this more python-happy? Thanks!

a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})

print "Input Dataframe:\n", a
print '\nDesired Output:'

# Build desired output:
uniqIDs = a.user_id.unique()

for id in uniqIDs:

    prod_list = list(a[a.user_id == id].prod_id.values)        

    s = id + '\t'
    for x in prod_list:
        s += x + '\t'

    print s # This will get saved to a TAB DELIMITED file

Gives this output (which is exactly what I desire):

Input Dataframe:
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c

Desired Output:
a   p1  p2  
b   p1  
c   p2  p3  p7

1 Answer 1

3

You could use groupby:

>>> a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})
>>> a
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c
>>> a.groupby("user_id")["prod_id"].unique()
user_id
a              [p1, p2]
b                  [p1]
c          [p2, p3, p7]
dtype: object

Hmm.

Getting the output in the exact format you want is surprisingly troublesome, because I can't see any way to override pandas' desire to escape things. IOW, it's easy to .apply('\t'.join) to the resulting series manually, but it's hard to save it using \t as a separator too.

So here's another approach:

>>> df = pd.DataFrame({k: g.reset_index(drop=True) 
                       for k,g in a.groupby("user_id")["prod_id"]}).T
>>> df.to_csv("prod.csv", sep="\t", header=False)
>>> !cat prod.csv
a   p1  p2  
b   p1      
c   p2  p3  p7

If you really want, you could strip the extra tabs at the end.

Sign up to request clarification or add additional context in comments.

2 Comments

Looks good @DSM. Is there an efficient way to remove the brackets and just end up with a tab-delimited output?
@zbinsd Checkout the str attribute of Series objects.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.