2

I use a for loop over a list filled with rasters. Of each raster I extract an array of data and I want to use the basename (date) of the raster as index for this array. For this I use a Pandas DataFrame Multi-Index. The array including the new set index is then appended to a HDFStore. Next a raster with another date is selected

Code snippets:

root, ext = os.path.splitext(raster)
name = int(decimal.Decimal(os.path.basename(root)))

array = ma.MaskedArray.compressed(raster)
arr2df = pd.DataFrame(pd.Series(data = array), columns=['rastervalue'])
arr2df['timestamp'] = pd.Series(name,index=arr2df.index)
arr2df.set_index('timestamp')
store.append('rastervalue',arr2df)

DataFrame seems to be ok (btw how can I retrieve a MultiIndex?).

>>> arr2df
<class 'pandas.core.frame.DataFrame'>
  MultiIndex: 123901 entries, (0, 20060101) to (123900, 20060101)
  Data columns (total 1 columns):
  rastervalue    123901  non-null values
  dtypes:        int32(1)

But at the moment when I check the HDFStore it seems that my Multi-Index is disappeared and changed into "values_block_1"

>>> store.root.rastervalue.table.read
<bound method Table.read of /rastervalue/table (Table(12626172,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2)}
  byteorder := 'little'
  chunkshape := (3276,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}>

>>> store.root.rastervalue.table.read(field="values_block_1")
array([[20060101],
       [20060101],
       [20060101],
       ...,
       [ 20060914],
       [ 20060914],
       [ 20060914]], dtype=int64)

By reading the documentation I can't figure out how to store or change a MultiIndex in a HDFStore correctly. Any suggestions? Eventually I would like to query the table as:

 store.select('rastervalue', [ pd.Term('index', '=', '20060101')])
2
  • your use of the MaskArray might be doing funny things with the index, can you provide an example that is reproducible and/or show some of the frame you are trying to store (df.head(10)) or something...? Commented Jun 25, 2013 at 11:35
  • I just noticed that your set_index is not assigned to anything; this is NOT an inplace operation (unless you pass inplace=True) Commented Jun 25, 2013 at 13:33

1 Answer 1

1

Here is a working example.

In [43]: df = DataFrame(dict(ivalue = range(123901), date = 20060101, 
              value = Series([1]*123901,dtype='int32'))).set_index(['ivalue','date'])

In [44]: df
Out[44]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 123901 entries, (0, 20060101) to (123900, 20060101)
Data columns (total 1 columns):
value    123901  non-null values
dtypes: int32(1)

In [45]: df.head()
Out[45]: 
                 value
ivalue date           
0      20060101      1
1      20060101      1
2      20060101      1
3      20060101      1
4      20060101      1

In [46]: store = pd.HDFStore('test.h5',mode='w')

In [48]: store.append('df',df)

In [49]: store
Out[49]: 
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df            frame_table  (typ->appendable_multi,nrows->123901,ncols->3,indexers->[index],dc->[date,ivalue])

In [50]: store.get_storer('df')
Out[50]: frame_table  (typ->appendable_multi,nrows->123901,ncols->3,indexers->[index],dc->[date,ivalue])
In [51]: store.get_storer('df').attrs
Out[51]: 
/df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['date', 'ivalue'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {'index': {}},
    levels := ['ivalue', 'date'],
    nan_rep := 'nan',
    non_index_axes := [(1, ['ivalue', 'date', 'value'])],
    pandas_type := u'frame_table',
    pandas_version := '0.10.1',
    table_type := u'appendable_multiframe',
    values_cols := ['values_block_0', 'date', 'ivalue']]

In [52]: store.get_storer('df').table
Out[52]: 
/df/table (Table(123901,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "ivalue": Int64Col(shape=(), dflt=0, pos=3)}
  byteorder := 'little'
  chunkshape := (2340,)
  autoIndex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "ivalue": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
Sign up to request clarification or add additional context in comments.

6 Comments

I see, my mistake was already made with setting the multi-index on the DataFrame. Make sense then that the HDFStore doensn't like the input. Thanks for explaining the set_index related to the inplace operation. I didn't realize needing it. Btw, now I also don't need it, since your working example works like a charm. I hope one day to be able to answers question on SO like you do:). Keep it up.
np glad it worked out. you might find the cookbook useful: pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore
what is the meaning of is_CSI=False ?
ok, Completely Sorted Index. So, does this mean, that the indexing is slower than it can be, because this is set to False?
read this section: pytables.github.io/usersguide/optimization.html; it is rarely necessary to create a CSI and pretty time consuming to do so (and worse it can only be done for 1 index). but if u want to experiment (and have lots of time) it might pay.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.