2

I have a pandas Data Frame having one column containing arrays. I'd like to "flatten" it by repeating the values of the other columns for each element of the arrays.

I succeed to make it by building a temporary list of values by iterating over every row, but it's using "pure python" and is slow.

Is there a way to do this in pandas/numpy? In other words, I try to improve the flatten function in the example below.

Thanks a lot.

toConvert = pd.DataFrame({
    'x': [1, 2],
    'y': [10, 20],
    'z': [(101, 102, 103), (201, 202)]
})

def flatten(df):
    tmp = []
    def backend(r):
        x = r['x']
        y = r['y']
        zz = r['z']
        for z in zz:
            tmp.append({'x': x, 'y': y, 'z': z})
    df.apply(backend, axis=1)
    return pd.DataFrame(tmp)

print(flatten(toConvert).to_string(index=False))

Which gives:

x   y    z
1  10  101
1  10  102
1  10  103
2  20  201
2  20  202

2 Answers 2

3

Here's a NumPy based solution -

np.column_stack((toConvert[['x','y']].values.\
     repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))

Sample run -

In [78]: toConvert
Out[78]: 
   x   y                z
0  1  10  (101, 102, 103)
1  2  20       (201, 202)

In [79]: np.column_stack((toConvert[['x','y']].values.\
    ...:      repeat(map(len,toConvert.z),axis=0),np.hstack(toConvert.z)))
Out[79]: 
array([[  1,  10, 101],
       [  1,  10, 102],
       [  1,  10, 103],
       [  2,  20, 201],
       [  2,  20, 202]])
Sign up to request clarification or add additional context in comments.

1 Comment

It can also be done using Pandas' DataFrame.apply(...) instead of Python's map(...): np.column_stack((toConvert[['x', 'y']].values.repeat(toConvert['z'].apply(len), axis=0), np.hstack(toConvert['z'])))
2

You need numpy.repeat with str.len for creating columns x and y and for z use this solution:

import pandas as pd
import numpy as np
from  itertools import chain

df = pd.DataFrame({
        "x": np.repeat(toConvert.x.values, toConvert.z.str.len()),
        "y": np.repeat(toConvert.y.values, toConvert.z.str.len()),
        "z": list(chain.from_iterable(toConvert.z))})

print (df)          
   x   y    z
0  1  10  101
1  1  10  102
2  1  10  103
3  2  20  201
4  2  20  202

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.