Python - From Pandas to Sparse Output Format

Question

Is there a better way to do what the code below does in a (slow!) loop?

Using an input DataFrame, I want to convert it to a list of products each user has consumed. But this list will be up to the millions long and this seems quite inefficient (unless I use cython). Any ideas how to make this more python-happy? Thanks!

a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})

print "Input Dataframe:\n", a
print '\nDesired Output:'

# Build desired output:
uniqIDs = a.user_id.unique()

for id in uniqIDs:

    prod_list = list(a[a.user_id == id].prod_id.values)        

    s = id + '\t'
    for x in prod_list:
        s += x + '\t'

    print s # This will get saved to a TAB DELIMITED file

Gives this output (which is exactly what I desire):

Input Dataframe:
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c

Desired Output:
a   p1  p2  
b   p1  
c   p2  p3  p7

DSM · Accepted Answer · 2013-10-10 19:43:30Z

3

You could use groupby:

>>> a = pd.DataFrame({'user_id':['a', 'a', 'b', 'c', 'c', 'c'], 'prod_id':['p1', 'p2', 'p1', 'p2', 'p3', 'p7']})
>>> a
  prod_id user_id
0      p1       a
1      p2       a
2      p1       b
3      p2       c
4      p3       c
5      p7       c
>>> a.groupby("user_id")["prod_id"].unique()
user_id
a              [p1, p2]
b                  [p1]
c          [p2, p3, p7]
dtype: object

Hmm.

Getting the output in the exact format you want is surprisingly troublesome, because I can't see any way to override pandas' desire to escape things. IOW, it's easy to .apply('\t'.join) to the resulting series manually, but it's hard to save it using \t as a separator too.

So here's another approach:

>>> df = pd.DataFrame({k: g.reset_index(drop=True) 
                       for k,g in a.groupby("user_id")["prod_id"]}).T
>>> df.to_csv("prod.csv", sep="\t", header=False)
>>> !cat prod.csv
a   p1  p2  
b   p1      
c   p2  p3  p7

If you really want, you could strip the extra tabs at the end.

edited Oct 10, 2013 at 19:43

answered Oct 10, 2013 at 19:20

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

zbinsd Over a year ago

Looks good @DSM. Is there an efficient way to remove the brackets and just end up with a tab-delimited output?

Phillip Cloud Over a year ago

@zbinsd Checkout the str attribute of Series objects.

Collectives™ on Stack Overflow

Python - From Pandas to Sparse Output Format

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related