Pandas DataFrame + object type + HDF + PyTables 'table'

Question

(Editing to clarify my application, sorry for any confusion)

I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.

So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.

Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.

Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.

I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?

I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials

If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.

This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.

But I welcome most enthusiastically a more canonical solution.

Thanks so much for all your time!

EDIT: Adding code (Jeff's suggestion)

import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]

df = pd.DataFrame(mydata)

In [4]: df
Out[4]: 
                                                   0
0                               [1.28822975392e-231]
1           [1.28822975392e-231, -2.31584192385e+77]
2  [1.28822975392e-231, -1.49166823584e-154, 2.12...
3  [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4  [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5  [1.28822975392e-231, 1.49166823584e-154, 1.531...
6  [1.28822975392e-231, -2.68156174706e+154, 2.20...
7  [1.28822975392e-231, -2.68156174706e+154, 2.13...
8  [1.28822975392e-231, -1.3365130604e-315, 2.222...
9  [1.28822975392e-231, -1.33651054067e-315, 2.22...

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0    10  non-null values
dtypes: object(1)

df.to_hdf('test.h5','data')
--> OK

df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype

why don't you share some code which shows the structure of your data? and show df.info() — Jeff
– Jeff, Commented Aug 29, 2013 at 11:29
Maybe @Jeff can offer a way to do this the way you want. But this is an unidiomatic use of a DataFrame. I'd suggest setting it up as a DataFrame with 10 columns; variable-length columns (i.e., containing some NaNs) are easier to handle than variable-length rows. — Dan Allan
– Dan Allan, Commented Aug 29, 2013 at 17:06
Thanks Dan! I ended up doing that and having a separate 'stop' column indicating the real length of each array (Nans mean yet another different thing in my context). I was a little bummed to find out HDF store does not support masked arrays. — Kaushik Ghose
– Kaushik Ghose, Commented Aug 29, 2013 at 17:32
Masked arrays should be transformed to frames directly; pls show your code — Jeff
– Jeff, Commented Aug 29, 2013 at 19:39
@DanAllan is exactly on point here - you need to use a DataFrame with a base scalar (eg float); and not object which is not efficient at all — Jeff
– Jeff, Commented Aug 29, 2013 at 19:40

Jeff · Accepted Answer · 2013-08-29 19:38:41Z

Here's a simple example along the lines of what you have described

In [17]: df = DataFrame(randn(10,10))

In [18]: df.iloc[5:10,7:9] = np.nan

In [19]: df.iloc[7:10,4:9] = np.nan

In [22]: df.iloc[7:10,-1] = np.nan

In [23]: df
Out[23]: 
          0         1         2         3         4         5         6         7         8         9
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN

In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)

In [25]: df
Out[25]: 
          0         1         2         3         4         5         6         7         8         9  stop
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996     9
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824     9
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806     9
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520     9
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789     9
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333     9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012     9
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN     3
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN     3
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN     3

Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)

In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')

In [27]: pd.read_hdf('test.h5','df')
Out[27]: 
          0         1         2         3         4         5         6         7         8         9  stop
0 -1.671523  0.277972 -1.217315 -1.390472  0.944464 -0.699266  0.348579  0.635009 -0.330561 -0.121996     9
1  0.239482 -0.050869  0.488322 -0.668864  0.125534 -0.159154  1.092619 -0.638932 -0.091755  0.291824     9
2  0.432216 -1.101879  2.082755 -0.500450  0.750278 -1.960032 -0.688064 -0.674892  3.225115  1.035806     9
3  0.775353 -1.320165 -0.180931  0.342537  2.009530  0.913223  0.581071 -1.111551  1.118720 -0.081520     9
4 -0.255524  0.143255 -0.230755 -0.306252  0.748510  0.367886 -1.032118  0.232410  1.415674 -0.420789     9
5 -0.850601  0.273439 -0.272923 -1.248670  0.041129  0.506832  0.878972       NaN       NaN  0.433333     9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417  0.504146       NaN       NaN -0.635012     9
7 -0.241512  0.159100  0.223019 -0.750034       NaN       NaN       NaN       NaN       NaN       NaN     3
8 -1.511968 -0.391903  0.257445 -1.642250       NaN       NaN       NaN       NaN       NaN       NaN     3
9 -0.376762  0.977394  0.760578  0.964489       NaN       NaN       NaN       NaN       NaN       NaN     3

Thanks for the answer @Jeff, but the saved frame_table treats the values as NaNs. Perhaps I'm misusing the masked array framework but in my application I have both missing data as well as empty data. So my row can be missing (NaN) or have variable number of data points (including zero). This is why a simple NaN is not working for me.
its simple enough to have another column that tracks the maximum length of your data set then. otherwise you should just store these things separate as they really separate. pandas aligns on the axes, if you are not using that then you should keep your data some other way.

Collectives™ on Stack Overflow

Pandas DataFrame + object type + HDF + PyTables 'table'

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related