(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype