in a previous thread, a brilliant response was given to the following problem(Pandas: reshaping data). The goal is to reshape a pandas series containing lists into a pandas dataframe in the following way:
In [9]: s = Series([list('ABC'),list('DEF'),list('ABEF')])
In [10]: s
Out[10]:
0 [A, B, C]
1 [D, E, F]
2 [A, B, E, F]
dtype: object
should be shaped into this:
Out[11]:
A B C D E F
0 1 1 1 0 0 0
1 0 0 0 1 1 1
2 1 1 0 0 1 1
That is, a dataframe is created where every element in the lists of the series becomes a column. For every element in the series, a row in the dataframe is created. For every element in the lists, a 1 is assigned to the corresponding dataframe column (and 0 otherwise). I know that the wording may be cumbersome, but hopefully the example above is clear.
The brilliant response by user Jeff (https://stackoverflow.com/users/644898/jeff) was to write this simple yet powerful line of code:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
That turns [10] into out[11].
That line of code served me extremely well, however I am running into memory issues with a series of roughly 50K elements and about 100K different elements in all lists. My machine has 16G of memory. Before resorting to a bigger machine, I would like to think of a more efficient implementation of the function above.
Does anyone know how to re-implement the above line:
In [11]: s.apply(lambda x: Series(1,index=x)).fillna(0)
to make it more efficient, in terms of memory usage?