1

in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data). The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:

In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])

In [10]: s
Out[10]: 
0       [A, B, C]
1       [D, E, F]
2    [A, B, E, F]
dtype: object

should be shaped into this:

Out[11]: 
   A  B  C  D  E  F
0  1  1  1  0  0  0
1  0  0  0  1  1  1
2  1  1  0  0  1  1

That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.

The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:

In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)

That turns [10] into out[11].

That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.

Does anyone know how to re-implement the above line:

In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)

to make it more efficient, in terms of memory usage?

2
  • How is the original Series generated? Your best bet is to avoid ever having lists stored in it. Commented Oct 23, 2015 at 3:24
  • good question. unfortunately, the series is generated through scrapping - not much I can do about it - I inherited the dataset from the client Commented Oct 23, 2015 at 3:49

2 Answers 2

1

You could try breaking your dataframe into chunks and writing to a file as you go, something like this:

chunksize = 10000
def f(df):
    return f.apply(lambda x: Series(1,index=x)).fillna(0)

with open('out.csv','w') as f:
   f.write(df.ix[[]].to_csv()) #write the header
   for chunk in df.groupby(np.arange(len(df))//chunksize):
      f.write(f(chunk).to_csv(header=None))
Sign up to request clarification or add additional context in comments.

Comments

1

If memory use is the issue, it seems like a sparse matrix solution would be better. Pandas doesn't really have sparse matrix support, but you could use scipy.sparse like this:

data = pd.Series([list('ABC'),list('DEF'),list('ABEF')])

from scipy.sparse import csr_matrix
cols, ind = np.unique(np.concatenate(data), return_inverse=True)
indptr = np.cumsum([0] + list(map(len, data)))
vals = np.ones_like(ind)
M = csr_matrix((vals, ind, indptr))

This sparse matrix now contains the same data as the pandas solution, but the zeros are not explicitly stored. We can confirm this by converting the sparse matrix to a dataframe:

>>> pd.DataFrame(M.toarray(), columns=cols)
   A  B  C  D  E  F
0  1  1  1  0  0  0
1  0  0  0  1  1  1
2  1  1  0  0  1  1

Depending on what you're doing with the data from here, having it in a sparse form may help solve your problem without using excessive memory.

1 Comment

jakevdp: awesome, awesome response. what used to take HOURS now takes seconds. Nothing is more powerful that a good algorithm!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.