How to put lists of numpy ndarrays into columns of pandas dataframe?

Question

I have input which is formatted like so [notice there will be more than just a and b]:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2)),
    "c": np.zeros((1000, 7, 8, 2,)),
    "d": np.zeros((1000, 6,)),
}

I want a dataframe with 1000 rows, and 3*4 + 5*2 named columns [say, a_0_0 ... a_2_3 and so on] to contain the data which resides inside in.

Unfortunately, this is not trivial, and searches in google or the docs or stackoverflow give unrelated answers.

What is the standard way to do this?

I have tried creating a df from each of a and b, then vstacking, then renaming the columns, but it seems like a huge overkill for something that should be much easier.

EDIT:

I am sorry, I should have mentioned arrays can be of more than 3 dimensions. size is arbitrary.

Ch3steR · Accepted Answer · 2020-10-29 14:52:00Z

2

You can use np.ndarray.reshape then use np.column_stack here.

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2)),
    "c": np.zeros((1000, 7, 8, 2,)),
    "d": np.zeros((1000, 6,)),
}

arrs = [arr.reshape(1000, -1) for arr in inp.values()]
out = np.column_stack(arrs)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
out.shape
(1000, 140) # 3*4 + 5*2 + 7*8*2 + 6 = 140

For columns you can use itertools.product and use itertools.chain.from_iterable here.

from itertools import product, chain
shapes = [(name, arr.shape[1:]) for name, arr in inp.items()]

def col_names(val):
    *prefix, shape = val
    names = [map(str, range(i)) for i in shape]
    return map('_'.join, product(prefix, *names))
cols = [*chain.from_iterable(col_names(val) for val in shapes)]
len(cols) # 140
cols
['a_0_0',
 'a_0_1',
 'a_0_2',
 ...
 'a_2_2',
 'a_2_3',
 'b_0_0',
 'b_0_1',
 ...
 'b_4_1',
 'c_0_0_0',
 ...
 'c_6_6_1',
 'c_6_7_0',
 'c_6_7_1',
 'd_0',
 ...
 'd_5']

Now use cols as columns in your dataFrame.

pd.DataFrame(out, columns=cols)
     a_0_0  a_0_1  a_0_2  a_0_3  a_1_0  a_1_1  ...  d_0  d_1  d_2  d_3  d_4  d_5
0      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
1      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
2      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
3      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
4      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
..     ...    ...    ...    ...    ...    ...  ...  ...  ...  ...  ...  ...  ...
995    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
996    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
997    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
998    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
999    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0

[1000 rows x 140 columns]

edited Oct 29, 2020 at 14:52

answered Oct 27, 2020 at 9:54

Ch3steR

20.8k4 gold badges34 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Gulzar Over a year ago

nice solution! what about the column names?

Ch3steR Over a year ago

@Gulzar Should names be like a1 to a12 and b13 to b22?

Gulzar Over a year ago

a_0_0..a_2_3 ... b_0_0...b_4_1 ... and so on for other names.

Gulzar Over a year ago

please see edit to question, I am sorry, should have stated that sooner.

Gulzar Over a year ago

the product trick is the neatest I've seen in months

|

Andrej Kesely · Accepted Answer · 2020-10-27 09:58:42Z

You can create two dataframes and then pd.concat them:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2))
}

df1 = pd.DataFrame([{'a_{}_{}'.format(i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in inp['a']])
df2 = pd.DataFrame([{'b_{}_{}'.format(i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in inp['b']])

print( pd.concat([df1, df2], axis=1) )

Prints:

     a_0_0  a_0_1  a_0_2  a_0_3  a_1_0  ...  b_2_1  b_3_0  b_3_1  b_4_0  b_4_1
0      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
1      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
2      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
3      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
4      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
..     ...    ...    ...    ...    ...  ...    ...    ...    ...    ...    ...
995    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
996    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
997    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
998    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
999    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0

[1000 rows x 22 columns]

EDIT: To have arbitrary number of keys:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2))
}

dfs = []
for k, v in inp.items():
    dfs.append( pd.DataFrame([{'{}_{}_{}'.format(k, i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in v])  )

print( pd.concat(dfs, axis=1) )

Please notice there are more than just a and b, there can be an arbitrary number of such entries in inp

Collectives™ on Stack Overflow

How to put lists of numpy ndarrays into columns of pandas dataframe?

2 Answers 2

11 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related