Problems with querying multiindex table in HDF when using data_columns

Question

I try to query a multi-index table in a pandas HDF store, but it fails when using a query over the index and data_columns at the same time. This only occurs when data_columns=True. Any idea if this is expected, or how to avoid if I don't want to explicitly specify the data_columns?

See the following example, it seems it does not recognize the index as a valid reference:

import pandas as pd
import numpy as np

file_path = 'D:\\test_store.h5'
np.random.seed(1234)
pd.set_option('display.max_rows',4)
# simulate some data
index = pd.MultiIndex.from_product([np.arange(10000,10200),
                                    pd.date_range('19800101',periods=500)],
                                   names=['id','date'])
df = pd.DataFrame(dict(id2=np.random.randint(0, 1000, size=len(index)),
                       w=np.random.randn(len(index))),
                  index=index).reset_index().set_index(['id', 'date'])

# store the data
store =  pd.HDFStore(file_path,mode='a',complib='blosc', complevel=9)
store.append('df_dc_None', df, data_columns=None)
store.append('df_dc_explicit', df, data_columns=['id2', 'w'])
store.append('df_dc_True', df, data_columns=True)
store.close()

# query the data
start = '19810201'
print(pd.read_hdf(file_path,'df_dc_None', where='date>start & id=10000'))
print(pd.read_hdf(file_path,'df_dc_True', where='id2>500'))
print(pd.read_hdf(file_path,'df_dc_explicit', where='date>start & id2>500'))
try:
    print(pd.read_hdf(file_path,'df_dc_True', where='date>start & id2>500'))
except ValueError as err:
    print(err)

MaxU - stand with Ukraine · Accepted Answer · 2016-10-16 10:08:57Z

2

It's an interesting question, indeed!

I can't explain the following difference (why do we have index columns indexed when using data_columns=None (default due to the docstring of the HDFStore.append method) and we don't have them indexed when using data_columns=True):

In [114]: store.get_storer('df_dc_None').table
Out[114]:
/df_dc_None/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int32Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": Float64Col(shape=(1,), dflt=0.0, pos=2),
  "date": Int64Col(shape=(), dflt=0, pos=3),
  "id": Int64Col(shape=(), dflt=0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

In [115]: store.get_storer('df_dc_True').table
Out[115]:
/df_dc_True/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Int64Col(shape=(1,), dflt=0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "id2": Int32Col(shape=(), dflt=0, pos=3),
  "w": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}

NOTE: pay attention at colindexes in the output above.

But using the following simple hack we can "fix" this:

In [116]: store.append('df_dc_all', df, data_columns=df.head(1).reset_index().columns)

In [118]: store.get_storer('df_dc_all').table
Out[118]:
/df_dc_all/table (Table(100000,), shuffle, blosc(9)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "id": Int64Col(shape=(), dflt=0, pos=1),
  "date": Int64Col(shape=(), dflt=0, pos=2),
  "id2": Int32Col(shape=(), dflt=0, pos=3),
  "w": Float64Col(shape=(), dflt=0.0, pos=4)}
  byteorder := 'little'
  chunkshape := (1820,)
  autoindex := True
  colindexes := {
    "w": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "id2": Index(6, medium, shuffle, zlib(1)).is_csi=False}

check:

In [119]: pd.read_hdf(file_path,'df_dc_all', where='date>start & id2>500')
Out[119]:
                  id2         w
id    date
10000 1981-02-02  935  0.245637
      1981-02-04  994  0.291287
...               ...       ...
10199 1981-05-11  680 -0.370745
      1981-05-12  812 -0.880742

[10121 rows x 2 columns]

answered Oct 16, 2016 at 10:08

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

MMCM_ Over a year ago

Thanks for the hack, I suspect this is not supposed to be like this. I raised an issue with pandas, will see what they say. I guess we can close this one here.

MaxU - stand with Ukraine Over a year ago

@MMCM_, yep, it's interesting what will Pandas core team say about this - i'll monitor your issue on GitHub...

MMCM_ Over a year ago

seems like they'll fix it.

MMCM_ Over a year ago

fixed for pandas version >=19.2

Collectives™ on Stack Overflow

Problems with querying multiindex table in HDF when using data_columns

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related