6

To manage the amount of RAM I consume in doing an analysis, I have a large dataset stored in hdf5 (.h5) and I need to query this dataset efficiently using Pandas.

The data set contains user performance data for a suite of apps. I only want to pull a few fields out of the 40 possible, and then filter the resulting dataframe to only those users who are using a one of a few apps that interest me.

# list of apps I want to analyze
apps = ['a','d','f']

# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')

# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])

# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])

I realize that the string 'app in apps' is not what I want. This is simply a SQL-like representation of what I hope to achieve. I cant seem to pass a list of strings in any way that I try, but there must be a way.

For now I am simply running the query without this parameter and then I filter out the apps I don't want in a subsequent step thusly

df = df[df['app'].isin(apps)]

But this is much less efficient since ALL of the apps need to first be loaded into memory before I can remove them. In some cases, this is big problem because I don't have enough memory to support the whole unfiltered df.

1 Answer 1

14

You are pretty close.

In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
                        'B' : [1,2,1,2,1], 
                        'C' : np.random.randn(5) })

In [2]: df
Out[2]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Write the store as a table (note that in 0.12 you will use table=True, rather than format='table'). Remember to specify the data_columns that you want to query when creating the table (or you can do data_columns=True)

In [3]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=['A','B'])

In [4]: pd.read_hdf('test.h5','df')
Out[4]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

Syntax in master/0.13, isin is accomplished via query_column=list_of_values. This is presented as a string to where.

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

Syntax in 0.12, this must be a list (which ands the conditions).

In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.