0

I have a dataset in the following format. It got 48 columns and about 200000 rows.

slot1,slot2,slot3,slot4,slot5,slot6...,slot45,slot46,slot47,slot48
1,2,3,4,5,6,7,......,45,46,47,48
3.5,5.2,2,5.6,...............

I want to reshape this dataset to something as below, where N is less than 48 (maybe 24 or 12 etc..) column headers doesn't matter. when N = 4

slotNew1,slotNew2,slotNew3,slotNew4
1,2,3,4
5,6,7,8
......
45,46,47,48
3.5,5.2,2,5.6
............

I can read row by row and then split each row and append to a new dataframe. But that is very inefficient. Is there any efficient and faster way to do that?

4
  • Is each row a joined string? or already split cells? Commented Aug 12, 2019 at 2:02
  • already splitted cell :) I'm not splitting any cells. Commented Aug 12, 2019 at 2:04
  • And is N always factor of ncols? Commented Aug 12, 2019 at 2:06
  • hmm, It is not a must. But I can assume iN is a factor of 48 Commented Aug 12, 2019 at 2:07

2 Answers 2

1

You may try this

N = 4
df_new = pd.DataFrame(df_original.values.reshape(-1, N))
df_new.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]

The code extracts the data into numpy.ndarray, reshape it, and create a new dataset of desired dimension.

Example:

import numpy as np
import pandas as pd

df0 = pd.DataFrame(np.arange(48 * 3).reshape(-1, 48))
df0.columns = ['slot{:}'.format(i + 1) for i in range(48)]
print(df0)
#    slot1  slot2  slot3  slot4   ...    slot45  slot46  slot47  slot48
# 0      0      1      2      3   ...        44      45      46      47
# 1     48     49     50     51   ...        92      93      94      95
# 2     96     97     98     99   ...       140     141     142     143
# 
# [3 rows x 48 columns]

N = 4
df = pd.DataFrame(df0.values.reshape(-1, N))
df.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
print(df.head())
#    slotNew1  slotNew2  slotNew3  slotNew4
# 0         0         1         2         3
# 1         4         5         6         7
# 2         8         9        10        11
# 3        12        13        14        15
# 4        16        17        18        19

Another approach

N = 4
df1 = df0.stack().reset_index()
df1['i'] = df1['level_1'].str.replace('slot', '').astype(int) // N
df1['j'] = df1['level_1'].str.replace('slot', '').astype(int) % N
df1['i'] -= (df1['j'] == 0) - df1['level_0'] * 48 / N
df1['j'] += (df1['j'] == 0) * N
df1['j'] = 'slotNew' + df1['j'].astype(str)
df1 = df1[['i', 'j', 0]]
df = df1.pivot(index='i', columns='j', values=0)
Sign up to request clarification or add additional context in comments.

1 Comment

It was my mistake. I didn't remove unwanted columns before reshaping. When I remove the unwanted columns your solution works. Thanks (y)
1

Use pandas.explode after making chunks. Given df:

import pandas as pd

df = pd.DataFrame([np.arange(1, 49)], columns=['slot%s' % i for i in range(1, 49)])
print(df)

   slot1  slot2  slot3  slot4  slot5  slot6  slot7  slot8  slot9  slot10  ...  \
0      1      2      3      4      5      6      7      8      9      10  ...   

   slot39  slot40  slot41  slot42  slot43  slot44  slot45  slot46  slot47  \
0      39      40      41      42      43      44      45      46      47   

   slot48  
0      48  

Using chunks to divide:

def chunks(l, n):
    """Yield successive n-sized chunks from l.
    Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
    """
    n_items = len(l)
    if n_items % n:
        n_pads = n - n_items % n
    else:
        n_pads = 0
    l = l + [np.nan for _ in range(n_pads)] 
    for i in range(0, len(l), n):
        yield l[i:i + n]

N = 4
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)

Output:

     0   1   2   3
0    1   2   3   4
1    5   6   7   8
2    9  10  11  12
3   13  14  15  16
4   17  18  19  20
...

Advantage of this approach over numpy.reshape is that it can handle when N is not a factor:

N = 7
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)

Output:

    0   1   2   3   4   5     6
0   1   2   3   4   5   6   7.0
1   8   9  10  11  12  13  14.0
2  15  16  17  18  19  20  21.0
3  22  23  24  25  26  27  28.0
4  29  30  31  32  33  34  35.0
5  36  37  38  39  40  41  42.0
6  43  44  45  46  47  48   NaN

1 Comment

I marked kitman's answer since it id direct when the N is a factor of 48. But your answer is valid for even when the N is not a factor. Thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.