1

I have input which is formatted like so [notice there will be more than just a and b]:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2)),
    "c": np.zeros((1000, 7, 8, 2,)),
    "d": np.zeros((1000, 6,)),
}

I want a dataframe with 1000 rows, and 3*4 + 5*2 named columns [say, a_0_0 ... a_2_3 and so on] to contain the data which resides inside in.

Unfortunately, this is not trivial, and searches in google or the docs or stackoverflow give unrelated answers.

What is the standard way to do this?


I have tried creating a df from each of a and b, then vstacking, then renaming the columns, but it seems like a huge overkill for something that should be much easier.


EDIT:

I am sorry, I should have mentioned arrays can be of more than 3 dimensions. size is arbitrary.

2 Answers 2

2

You can use np.ndarray.reshape then use np.column_stack here.

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2)),
    "c": np.zeros((1000, 7, 8, 2,)),
    "d": np.zeros((1000, 6,)),
}

arrs = [arr.reshape(1000, -1) for arr in inp.values()]
out = np.column_stack(arrs)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
out.shape
(1000, 140) # 3*4 + 5*2 + 7*8*2 + 6 = 140

For columns you can use itertools.product and use itertools.chain.from_iterable here.

from itertools import product, chain
shapes = [(name, arr.shape[1:]) for name, arr in inp.items()]

def col_names(val):
    *prefix, shape = val
    names = [map(str, range(i)) for i in shape]
    return map('_'.join, product(prefix, *names))
cols = [*chain.from_iterable(col_names(val) for val in shapes)]
len(cols) # 140
cols
['a_0_0',
 'a_0_1',
 'a_0_2',
 ...
 'a_2_2',
 'a_2_3',
 'b_0_0',
 'b_0_1',
 ...
 'b_4_1',
 'c_0_0_0',
 ...
 'c_6_6_1',
 'c_6_7_0',
 'c_6_7_1',
 'd_0',
 ...
 'd_5']

Now use cols as columns in your dataFrame.

pd.DataFrame(out, columns=cols)
     a_0_0  a_0_1  a_0_2  a_0_3  a_1_0  a_1_1  ...  d_0  d_1  d_2  d_3  d_4  d_5
0      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
1      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
2      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
3      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
4      0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
..     ...    ...    ...    ...    ...    ...  ...  ...  ...  ...  ...  ...  ...
995    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
996    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
997    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
998    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0
999    0.0    0.0    0.0    0.0    0.0    0.0  ...  0.0  0.0  0.0  0.0  0.0  0.0

[1000 rows x 140 columns]
Sign up to request clarification or add additional context in comments.

11 Comments

nice solution! what about the column names?
@Gulzar Should names be like a1 to a12 and b13 to b22?
a_0_0..a_2_3 ... b_0_0...b_4_1 ... and so on for other names.
please see edit to question, I am sorry, should have stated that sooner.
the product trick is the neatest I've seen in months
|
1

You can create two dataframes and then pd.concat them:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2))
}

df1 = pd.DataFrame([{'a_{}_{}'.format(i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in inp['a']])
df2 = pd.DataFrame([{'b_{}_{}'.format(i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in inp['b']])

print( pd.concat([df1, df2], axis=1) )

Prints:

     a_0_0  a_0_1  a_0_2  a_0_3  a_1_0  ...  b_2_1  b_3_0  b_3_1  b_4_0  b_4_1
0      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
1      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
2      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
3      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
4      0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
..     ...    ...    ...    ...    ...  ...    ...    ...    ...    ...    ...
995    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
996    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
997    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
998    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0
999    0.0    0.0    0.0    0.0    0.0  ...    0.0    0.0    0.0    0.0    0.0

[1000 rows x 22 columns]

EDIT: To have arbitrary number of keys:

inp = {
    "a": np.zeros((1000, 3, 4)),
    "b": np.zeros((1000, 5, 2))
}

dfs = []
for k, v in inp.items():
    dfs.append( pd.DataFrame([{'{}_{}_{}'.format(k, i1, i2): v2 for i1, v1 in enumerate(row) for i2, v2 in enumerate(v1)} for row in v])  )

print( pd.concat(dfs, axis=1) )

1 Comment

Please notice there are more than just a and b, there can be an arbitrary number of such entries in inp

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.