I read a random subset of rows from an HDF table with the following Python/Pandas code:
hdf_store = pd.HDFStore('path_to_data.h5')
total_rows = hdf_store.get_storer('hdf_table_name').nrows
num_rows = int(total_rows * .25)
row_indices = np.random.randint(0,rows_indices,size=num_rows)
my_df = pd.read_hdf(hdf_store, 'hdf_table_name', where=pd.Index(row_indices))
Later in the program, I would like to pull the rest of the rows of data from the HDF5 table. But the following threw errors:
rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=pd.Index(not in (row_indices)))
rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=not pd.Index(row_indices))
Is there a way to pull HDF rows by records not in a list of indices?
Because the table is larger than my RAM, I would like to avoid pulling all the rows from HDF upfront (even in chunks) and then splitting it to hold both tables at once. I could map the indices to another column, and subset on rows not in the mapped value of that column. But that would presumably be a lot slower than querying on the index directly.