python dataframe - lambda X function - more efficient implementation possible?

Question

in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data). The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:

In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])

In [10]: s
Out[10]: 
0       [A, B, C]
1       [D, E, F]
2    [A, B, E, F]
dtype: object

should be shaped into this:

Out[11]: 
   A  B  C  D  E  F
0  1  1  1  0  0  0
1  0  0  0  1  1  1
2  1  1  0  0  1  1

That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.

The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:

In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)

That turns [10] into out[11].

That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.

Does anyone know how to re-implement the above line:

In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)

to make it more efficient, in terms of memory usage?

How is the original Series generated? Your best bet is to avoid ever having lists stored in it. — chrisb
– chrisb, Commented Oct 23, 2015 at 3:24
good question. unfortunately, the series is generated through scrapping - not much I can do about it - I inherited the dataset from the client — Alejandro Simkievich
– Alejandro Simkievich, Commented Oct 23, 2015 at 3:49

maxymoo · Accepted Answer · 2015-10-23 04:42:50Z

1

You could try breaking your dataframe into chunks and writing to a file as you go, something like this:

chunksize = 10000
def f(df):
    return f.apply(lambda x: Series(1,index=x)).fillna(0)

with open('out.csv','w') as f:
   f.write(df.ix[[]].to_csv()) #write the header
   for chunk in df.groupby(np.arange(len(df))//chunksize):
      f.write(f(chunk).to_csv(header=None))

answered Oct 23, 2015 at 4:42

maxymoo

36.7k12 gold badges97 silver badges121 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jakevdp · Accepted Answer · 2015-10-23 12:30:47Z

1

If memory use is the issue, it seems like a sparse matrix solution would be better. Pandas doesn't really have sparse matrix support, but you could use scipy.sparse like this:

data = pd.Series([list('ABC'),list('DEF'),list('ABEF')])

from scipy.sparse import csr_matrix
cols, ind = np.unique(np.concatenate(data), return_inverse=True)
indptr = np.cumsum([0] + list(map(len, data)))
vals = np.ones_like(ind)
M = csr_matrix((vals, ind, indptr))

This sparse matrix now contains the same data as the pandas solution, but the zeros are not explicitly stored. We can confirm this by converting the sparse matrix to a dataframe:

>>> pd.DataFrame(M.toarray(), columns=cols)
   A  B  C  D  E  F
0  1  1  1  0  0  0
1  0  0  0  1  1  1
2  1  1  0  0  1  1

Depending on what you're doing with the data from here, having it in a sparse form may help solve your problem without using excessive memory.

edited Oct 23, 2015 at 12:30

answered Oct 23, 2015 at 4:48

jakevdp

88.6k15 gold badges140 silver badges177 bronze badges

1 Comment

Alejandro Simkievich Over a year ago

jakevdp: awesome, awesome response. what used to take HOURS now takes seconds. Nothing is more powerful that a good algorithm!

Collectives™ on Stack Overflow

python dataframe - lambda X function - more efficient implementation possible?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related