1

I read a random subset of rows from an HDF table with the following Python/Pandas code:

hdf_store = pd.HDFStore('path_to_data.h5')
total_rows = hdf_store.get_storer('hdf_table_name').nrows

num_rows = int(total_rows * .25)
row_indices = np.random.randint(0,rows_indices,size=num_rows)

my_df = pd.read_hdf(hdf_store, 'hdf_table_name', where=pd.Index(row_indices))

Later in the program, I would like to pull the rest of the rows of data from the HDF5 table. But the following threw errors:

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
   where=pd.Index(not in (row_indices)))

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
   where=not pd.Index(row_indices))

Is there a way to pull HDF rows by records not in a list of indices?

Because the table is larger than my RAM, I would like to avoid pulling all the rows from HDF upfront (even in chunks) and then splitting it to hold both tables at once. I could map the indices to another column, and subset on rows not in the mapped value of that column. But that would presumably be a lot slower than querying on the index directly.

1 Answer 1

1

You can use Index.difference method.

Demo:

# randomly select 25% of index elements (without duplicates `replace=False`)
sample_idx = np.random.choice(np.arange(total_rows), total_rows//4, replace=False)

# select remaining index elements
rest_idx = pd.Index(np.arange(total_rows)).difference(sample_idx)

# get rest rows by index
rest = store.select('hdf_table_name', where=rest_idx)

PS optionally you may want to select rest of rows in chunks...

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.