Pandas read_csv(): column contains dataframe but is read as string

Question

I have a pandas dataframe (df1) consisting of 56 rows and 9 columns, one column ['Matrix'] contains a dataframe (df2) per cell. Saving and loading this df1 turns df2 into a string. However, I need this to be a dataframe.

I came across several solutions (see below) but can't seem to apply these to my situation as I have a dataframe containing a dataframe.

df2 is build up out of two columns ('Mean', 'SD') and consists out of 95k rows per df1 row (sample below (they're not all 0 ;)).

My gut feeling is that my approach itself is suboptimal to start with.

Could someone give me a nudge in the right direction?

' Mean SD\r\n0 0.00 0.000000\r\n1 0.00 0.000000\r\n2 0.00 0.000000\r\n3 0.00 0.000000\r\n4 0.00 0.000000\r\n5 0.00 0.000000\r\n6 0.00 0.000000\r\n7 0.00 0.000000\r\n8 0.00 0.000000\r\n9 ...... 0.000000\r\n95194 0.00 0.000000\r\n95195 0.00 0.000000\r\n95196 0.00 0.000000\r\n95197 0.00 0.000000\r\n95198 0.00 0.000000\r\n95199 0.00 0.000000\r\n\r\n[95200 rows x 2 columns]'

You just need to loop through df1 and convert the cell to a dataframe. I haven't tried it in a while but you can probably load it up as a StringIO object and read that object: stackoverflow.com/questions/22604564/… . Then you can store it in a dictionary with each key referring to the row number or something like that. (You can also keep it df1, but that normally wouldn't make much sense from a tabular pov) — k88
– k88, Commented Jan 18, 2020 at 7:46

Geeocode · Accepted Answer · 2020-01-19 10:02:35Z

You should use json file format as saving your DataFrame.

Consider the following simplified example:

import pandas as pd

df1 = pd.DataFrame([[0,1], [2,3]], columns=["first", "second"])
df2 = pd.DataFrame([[4,5,6], [7,8,9]])
df1["dataf"] = [df2, df2]

print("\nDataFrame df1:")
print("**************")  
print(df1)
print("\ndataf column:")
print("***************")  
print(df1["dataf"])
print("\ndataf column cell:")
print("********************")  
print(df1["dataf"][0])
print("Type of dataf cells:", type(df1["dataf"][0]))

Out:

DataFrame df1:
**************
   first  second                             dataf
0      0       1     0  1  2
0  4  5  6
1  7  8  9
1      2       3     0  1  2
0  4  5  6
1  7  8  9

dataf column:
***************
0       0  1  2
0  4  5  6
1  7  8  9
1       0  1  2
0  4  5  6
1  7  8  9
Name: dataf, dtype: object

dataf column cell:
********************
   0  1  2
0  4  5  6
1  7  8  9
Type of dataf cells: <class 'pandas.core.frame.DataFrame'>

Now save our DataFrame as json using pandas.DataFrame.to_json:

df1.to_json("test.json")

Loading our data back with pandas.read_json:

df1 = pd.read_json("test.json")

print("\nDataFrame df1:")
print("**************")  
print(df1)
print("\ndataf column:")
print("***************")  
print(df1["dataf"])
print("\ndataf column cell:")
print("********************")  
print(df1["dataf"][0])
print("Type of dataf cells:", type(df1["dataf"][0]))

Out:

DataFrame df1:
**************
   first  second                                              dataf
0      0       1  {'0': {'0': 4, '1': 7}, '1': {'0': 5, '1': 8},...
1      2       3  {'0': {'0': 4, '1': 7}, '1': {'0': 5, '1': 8},...

dataf column:
***************
0    {'0': {'0': 4, '1': 7}, '1': {'0': 5, '1': 8},...
1    {'0': {'0': 4, '1': 7}, '1': {'0': 5, '1': 8},...
Name: dataf, dtype: object

dataf column cell:
********************
{'0': {'0': 4, '1': 7}, '1': {'0': 5, '1': 8}, '2': {'0': 6, '1': 9}}
Type of dataf cells: <class 'dict'>

We can simply convert our effected columns to DataFrames using pandas.DataFrame.apply

df1["dataf"] = df1["dataf"].apply(lambda x: pd.DataFrame(x))

print("\nDataFrame df1:")
print("**************")  
print(df1)
print("\ndataf column:")
print("***************")  
print(df1["dataf"])
print("\ndataf column cell:")
print("********************")  
print(df1["dataf"][0])
print("Type of dataf cells:", type(df1["dataf"][0]))

Out:

DataFrame df1:
**************
   first  second                             dataf
0      0       1     0  1  2
0  4  5  6
1  7  8  9
1      2       3     0  1  2
0  4  5  6
1  7  8  9

dataf column:
***************
0       0  1  2
0  4  5  6
1  7  8  9
1       0  1  2
0  4  5  6
1  7  8  9
Name: dataf, dtype: object

dataf column cell:
********************
   0  1  2
0  4  5  6
1  7  8  9
Type of dataf cells: <class 'pandas.core.frame.DataFrame'>

You can see that our core code is so simple as follows:

df1.to_json("test.json")
df1 = pd.read_json("test.json")
df1["dataf"] = df1["dataf"].apply(lambda x: pd.DataFrame(x))

Amazing, thank you! As mentioned, I was doubting my approach with the csv export. This is perfect.

Collectives™ on Stack Overflow

Pandas read_csv(): column contains dataframe but is read as string

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related