2

I try to query a multi-index table in a pandas HDF store, but it fails when using a query over the index and data_columns at the same time. This only occurs when data_columns=True. Any idea if this is expected, or how to avoid if I don't want to explicitly specify the data_columns?

See the following example, it seems it does not recognize the index as a valid reference:

import pandas as pd
import numpy as np

file_path = 'D:\\test_store.h5'
np.random.seed(1234)
pd.set_option('display.max_rows',4)
# simulate some data
index = pd.MultiIndex.from_product([np.arange(10000,10200),
                                    pd.date_range('19800101',periods=500)],
                                   names=['id','date'])
df = pd.DataFrame(dict(id2=np.random.randint(0, 1000, size=len(index)),
                       w=np.random.randn(len(index))),
                  index=index).reset_index().set_index(['id', 'date'])

# store the data
store =  pd.HDFStore(file_path,mode='a',complib='blosc', complevel=9)
store.append('df_dc_None', df, data_columns=None)
store.append('df_dc_explicit', df, data_columns=['id2', 'w'])
store.append('df_dc_True', df, data_columns=True)
store.close()

# query the data
start = '19810201'
print(pd.read_hdf(file_path,'df_dc_None', where='date>start & id=10000'))
print(pd.read_hdf(file_path,'df_dc_True', where='id2>500'))
print(pd.read_hdf(file_path,'df_dc_explicit', where='date>start & id2>500'))
try:
    print(pd.read_hdf(file_path,'df_dc_True', where='date>start & id2>500'))
except ValueError as err:
    print(err)

1 Answer 1

2

It's an interesting question, indeed!

I can't explain the following difference (why do we have index columns indexed when using data_columns=None (default due to the docstring of the HDFStore.append method) and we don't have them indexed when using data_columns=True):

In [114]: store.get_storer('df_dc_None').table
Out[114]:
/df_dc_None/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": Float64Col(shape=(1,), dflt=0.0, pos=2),
  "date": Int64Col(shape=(), dflt=0, pos=3),
  "id": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [115]: store.get_storer('df_dc_True').table
Out[115]:
/df_dc_True/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "id2": Int32Col(shape=(), dflt=0, pos=3),
  "w": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}

NOTE: pay attention at colindexes in the output above.

But using the following simple hack we can "fix" this:

In [116]: store.append('df_dc_all', df, data_columns=df.head(1).reset_index().columns)

In [118]: store.get_storer('df_dc_all').table
Out[118]:
/df_dc_all/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "id": Int64Col(shape=(), dflt=0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "id2": Int32Col(shape=(), dflt=0, pos=3),
  "w": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}

check:

In [119]: pd.read_hdf(file_path,'df_dc_all', where='date>start & id2>500')
Out[119]:
                  id2         w
id    date
10000 1981-02-02  935  0.245637
      1981-02-04  994  0.291287
...               ...       ...
10199 1981-05-11  680 -0.370745
      1981-05-12  812 -0.880742

[10121 rows x 2 columns]
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the hack, I suspect this is not supposed to be like this. I raised an issue with pandas, will see what they say. I guess we can close this one here.
@MMCM_, yep, it's interesting what will Pandas core team say about this - i'll monitor your issue on GitHub...
seems like they'll fix it.
fixed for pandas version >=19.2

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.