0

I am trying to unroll a list within a column to add more rows for the purpose of feeding it into a swarmplot.

Right now, I build a dictionary of lists:

# store all list of metrics
clf_aucs = dict()
_list = np.arange(0, 500) # build dummy list of floats
clf_aucs[id] = _list

This dictionary is say 5 keys, each with a list of 500 floats. When I next create a dataframe:

clf_aucs_df = pd.DataFrame(clf_aucs, 
                          ).transpose()
clf_aucs_df = clf_aucs_df.reset_index()
display(clf_aucs_df.head())
print(clf_aucs_df.shape)

The result would look like:

index   0   1   2   3   4   5   6   7   8   ... 490 491 492 493 494 495 496 497 498 499
0   clf0    0.432609    0.398760    0.292517    0.411905    0.385375    0.390023    0.364286    0.364035    0.450000    ... 0.477273    0.355372    0.378000    0.386667    0.396104    0.395085    0.426667    0.461957    0.402746    0.445238
1   clf1    0.432900    0.231602    0.416149    0.365217    0.414286    0.461039    0.325217    0.357143    0.447826    ... 0.402893    0.323913    0.420949    0.434783    0.372294    0.360417    0.410208    0.420949    0.392857    0.343685
2   clf2    0.322314    0.400000    0.409524    0.405797    0.466942    0.383399    0.478261    0.405896    0.432892    ... 0.371542    0.494318    0.493750    0.415238    0.414079    0.400433    0.402778    0.493478    0.478261    0.458498
3   clf3    0.509921    0.579051    0.545455    0.658103    0.576560    0.500000    0.515810    0.505682    0.525880    ... 0.590909    0.553360    0.409938    0.462585    0.584348    0.575397    0.472332    0.513834    0.587500    0.612500
4   clf4    0.474206    0.490451    0.479437    0.593750    0.545455    0.580357    0.484127    0.596273    0.537549    ... 0.665909    0.545351    0.609375    0.556277    0.531522    0.511905    0.583851    0.543478    0.513889    0.583333
5 rows × 501 columns

My question is how can I merge the columns 0-499, so that the new dataframe would be 2500 rows x 2 column with the id-column, and the numerical column.

Other attempts:

  1. tried different ways of creating a dataframe from list of lists
  2. looked at merge/join, but this in general seemed to be geared towards "combining" separate dataframes and adding columns

2 Answers 2

3

I believe what you are looking for is pd.melt:

import numpy as np
import pandas as pd

# recreate DataFrame from example
clf_aucs = dict()
for id_ in range(5):
    clf_aucs[f"clf{id_}"] = np.random.uniform(size=(500, ))
clf_aucs_df = pd.DataFrame(clf_aucs).T.reset_index().rename(
    columns={"index": "ID"})

# melt DataFrame
clf_aucs_df = pd.melt(clf_aucs_df, id_vars="ID", value_name="Numerical_Column")

# drop what were the column names prior to reshaping the DataFrame
clf_aucs_df.drop(columns="variable", inplace=True)

# sort first on ID and then on Numerical_Column
clf_aucs_df.sort_values(["ID", "Numerical_Column"], inplace=True)

# reindex from 0
clf_aucs_df.reset_index(drop=True, inplace=True)

The input was:

     ID         0         1         2  ...       496       497       498       499
0  clf0  0.647251  0.976586  0.675573  ...  0.911264  0.983211  0.685464  0.519285
1  clf1  0.034560  0.340834  0.443456  ...  0.412356  0.968721  0.833882  0.634775
2  clf2  0.723530  0.087285  0.014977  ...  0.563904  0.962543  0.860245  0.679423
3  clf3  0.863781  0.609096  0.214915  ...  0.382548  0.798677  0.196336  0.673109
4  clf4  0.185867  0.006018  0.635887  ...  0.622308  0.802546  0.771671  0.536761

and the output is:

        ID  Numerical_Column
0     clf0          0.000779
1     clf0          0.001084
2     clf0          0.001478
3     clf0          0.004019
4     clf0          0.004034
...    ...               ...
2495  clf4          0.996943
2496  clf4          0.998093
2497  clf4          0.998384
2498  clf4          0.999620
2499  clf4          0.999668
Sign up to request clarification or add additional context in comments.

Comments

2

One liner:

pd.DataFrame(data_dict).T.stack().reset_index().drop(columns=['level_1'])

How it works, step by step:

>>> data = {'clf0': [1,2,3,4], 'clf1': [5,6,7,8]}                                                      
>>> df = pd.DataFrame(data)
>>> df                                                                                                 
   clf0  clf1
0     1     5
1     2     6
2     3     7
3     4     8
>>> df.T.stack().reset_index()
  level_0  level_1  0
0    clf0        0  1
1    clf0        1  2
2    clf0        2  3
3    clf0        3  4
4    clf1        0  5
5    clf1        1  6
6    clf1        2  7
7    clf1        3  8
>>> # former index is now 'level_1', values are in columns '0'
>>> df.T.stack().reset_index().drop(columns=['level_1'])
  level_0  0
0    clf0  1
1    clf0  2
2    clf0  3
3    clf0  4
4    clf1  5
5    clf1  6
6    clf1  7
7    clf1  8

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.