11

I have a huge HDF5 file, I want to load part of it in a pandas DataFrame to perform some operations, but I am interested in filtering some rows.

I can explain better with an example:

Original HDF5 file would look something like:

A    B    C    D
1    0    34   11
2    0    32   15
3    1    35   22
4    1    34   15
5    1    31   9
1    0    34   15
2    1    29   11
3    0    34   15
4    1    12   14
5    0    34   15
1    0    32   13
2    1    34   15
etc  etc  etc  etc

What I am trying to do is to load this, exactly as it is, to a pandas Dataframe but only where A==1 or 3 or 4

Until now I can just load the whole HDF5 using:

store = pd.HDFStore('Resutls2015_10_21.h5')
df = pd.DataFrame(store['results_table'])

I do not see how to include a where condition here.

2 Answers 2

11

The hdf5 file must be written in table format (as opposed to fixed format) in order to be queryable with pd.read_hdf's where argument.

Furthermore, A must be declared as a data_column:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

or, to specify all columns as (queryable) data columns:

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=True,
          format='table')

Then you could use

pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]')

to select rows where the value column A is 1, 3 or 4. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2],
    'B': [0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1],
    'C': [34, 32, 35, 34, 31, 34, 29, 34, 12, 34, 32, 34],
    'D': [11, 15, 22, 15, 9, 15, 11, 15, 14, 15, 13, 15]})

df.to_hdf('/tmp/out.h5', 'results_table', mode='w', data_columns=['A'],
          format='table')

print(pd.read_hdf('/tmp/out.h5', 'results_table', where='A in [1,3,4]'))

yields

    A  B   C   D
0   1  0  34  11
2   3  1  35  22
3   4  1  34  15
5   1  0  34  15
7   3  0  34  15
8   4  1  12  14
10  1  0  32  13

If you have a very long list of values, vals, then you could use string formatting to compose the right where argument:

where='A in {}'.format(vals)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks unutbu, just some comments on this good answer. I understand that, at the begining of your answer you write the df to h5 in table format. However, the input to my script is a h5 already saved, how can I know if it is in the correct format or not?
If your h5 file is not in table format, then using pd.read_hdf with the where parameter will raise TypeError: cannot pass a where specification when reading from a Fixed format.... If the h5 file is in table format by A was not specified as a data_column, then you would get ValueError: The passed where expression: A in [1,3,4] contains an invalid variable reference....
I don't know of a quick/easy way to convert an h5 file from fixed to table format, or to add data_columns. As far as I know, you would have to read the entire h5 file into a DataFrame (or to do so in chunks using the chunksize parameter) and then to write it out or append to a different h5 file in table format.
1

You can do this using pandas.read_hdf (here), with the optional parameter of where.
For example: read_hdf('store_tl.h5', 'table', where = ['index>2'])

2 Comments

Thanks Dean, is it possible to include more complex conditions? like for example, if my A column has values from 1 to 100 and I want to select something random like [1,3,11,16,27,33,34,44,41,55,68,70,77,81,90]...I am asking that because in the main question I wanted to make it easy, but in my real case the where condition would need to be more complex
Found the documentation for the Expr you need to pass to where: nullege.com/codes/search/… (This might be helpful, though I don't see an option for filtering by containment)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.