Merge Multiple Columns As New Rows in Pandas Dataframe

Question

I am trying to unroll a list within a column to add more rows for the purpose of feeding it into a swarmplot.

Right now, I build a dictionary of lists:

# store all list of metrics
clf_aucs = dict()
_list = np.arange(0, 500) # build dummy list of floats
clf_aucs[id] = _list

This dictionary is say 5 keys, each with a list of 500 floats. When I next create a dataframe:

clf_aucs_df = pd.DataFrame(clf_aucs, 
                          ).transpose()
clf_aucs_df = clf_aucs_df.reset_index()
display(clf_aucs_df.head())
print(clf_aucs_df.shape)

The result would look like:

index   0   1   2   3   4   5   6   7   8   ... 490 491 492 493 494 495 496 497 498 499
0   clf0    0.432609    0.398760    0.292517    0.411905    0.385375    0.390023    0.364286    0.364035    0.450000    ... 0.477273    0.355372    0.378000    0.386667    0.396104    0.395085    0.426667    0.461957    0.402746    0.445238
1   clf1    0.432900    0.231602    0.416149    0.365217    0.414286    0.461039    0.325217    0.357143    0.447826    ... 0.402893    0.323913    0.420949    0.434783    0.372294    0.360417    0.410208    0.420949    0.392857    0.343685
2   clf2    0.322314    0.400000    0.409524    0.405797    0.466942    0.383399    0.478261    0.405896    0.432892    ... 0.371542    0.494318    0.493750    0.415238    0.414079    0.400433    0.402778    0.493478    0.478261    0.458498
3   clf3    0.509921    0.579051    0.545455    0.658103    0.576560    0.500000    0.515810    0.505682    0.525880    ... 0.590909    0.553360    0.409938    0.462585    0.584348    0.575397    0.472332    0.513834    0.587500    0.612500
4   clf4    0.474206    0.490451    0.479437    0.593750    0.545455    0.580357    0.484127    0.596273    0.537549    ... 0.665909    0.545351    0.609375    0.556277    0.531522    0.511905    0.583851    0.543478    0.513889    0.583333
5 rows × 501 columns

My question is how can I merge the columns 0-499, so that the new dataframe would be 2500 rows x 2 column with the id-column, and the numerical column.

Other attempts:

tried different ways of creating a dataframe from list of lists
looked at merge/join, but this in general seemed to be geared towards "combining" separate dataframes and adding columns

Clade · Accepted Answer · 2019-08-15 21:57:40Z

I believe what you are looking for is pd.melt:

import numpy as np
import pandas as pd

# recreate DataFrame from example
clf_aucs = dict()
for id_ in range(5):
    clf_aucs[f"clf{id_}"] = np.random.uniform(size=(500, ))
clf_aucs_df = pd.DataFrame(clf_aucs).T.reset_index().rename(
    columns={"index": "ID"})

# melt DataFrame
clf_aucs_df = pd.melt(clf_aucs_df, id_vars="ID", value_name="Numerical_Column")

# drop what were the column names prior to reshaping the DataFrame
clf_aucs_df.drop(columns="variable", inplace=True)

# sort first on ID and then on Numerical_Column
clf_aucs_df.sort_values(["ID", "Numerical_Column"], inplace=True)

# reindex from 0
clf_aucs_df.reset_index(drop=True, inplace=True)

The input was:

     ID         0         1         2  ...       496       497       498       499
0  clf0  0.647251  0.976586  0.675573  ...  0.911264  0.983211  0.685464  0.519285
1  clf1  0.034560  0.340834  0.443456  ...  0.412356  0.968721  0.833882  0.634775
2  clf2  0.723530  0.087285  0.014977  ...  0.563904  0.962543  0.860245  0.679423
3  clf3  0.863781  0.609096  0.214915  ...  0.382548  0.798677  0.196336  0.673109
4  clf4  0.185867  0.006018  0.635887  ...  0.622308  0.802546  0.771671  0.536761

and the output is:

        ID  Numerical_Column
0     clf0          0.000779
1     clf0          0.001084
2     clf0          0.001478
3     clf0          0.004019
4     clf0          0.004034
...    ...               ...
2495  clf4          0.996943
2496  clf4          0.998093
2497  clf4          0.998384
2498  clf4          0.999620
2499  clf4          0.999668

Marat · Accepted Answer · 2019-08-16 01:33:59Z

One liner:

pd.DataFrame(data_dict).T.stack().reset_index().drop(columns=['level_1'])

How it works, step by step:

>>> data = {'clf0': [1,2,3,4], 'clf1': [5,6,7,8]}                                                      
>>> df = pd.DataFrame(data)
>>> df                                                                                                 
   clf0  clf1
0     1     5
1     2     6
2     3     7
3     4     8
>>> df.T.stack().reset_index()
  level_0  level_1  0
0    clf0        0  1
1    clf0        1  2
2    clf0        2  3
3    clf0        3  4
4    clf1        0  5
5    clf1        1  6
6    clf1        2  7
7    clf1        3  8
>>> # former index is now 'level_1', values are in columns '0'
>>> df.T.stack().reset_index().drop(columns=['level_1'])
  level_0  0
0    clf0  1
1    clf0  2
2    clf0  3
3    clf0  4
4    clf1  5
5    clf1  6
6    clf1  7
7    clf1  8

Collectives™ on Stack Overflow

Merge Multiple Columns As New Rows in Pandas Dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related