I'm processing large number of files in python and need to write the output (one dataframe for each input file) in HDF5 directly.
I am wondering what is the best way to write pandas data frame from my script to HDF5 directly in a fast way? I am not sure if any python module like hdf5, hadoopy can do this. Any help in this regard will be appreciate.
-
matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2Nehal J Wani– Nehal J Wani2016-08-12 10:50:52 +00:00Commented Aug 12, 2016 at 10:50
-
Nickil suggested an edit to change HDFS to HDF5 (and then answered based on this), but both HDFS and HDF5 seems to make sense in the context of your question... which did you mean?Foon– Foon2016-08-12 11:43:48 +00:00Commented Aug 12, 2016 at 11:43
Add a comment
|
1 Answer
It's difficult to give you a good answer to this rather generic question.
It's not clear how are you going to use (read) your HDF5 files - do you want to select data conditionally (using where parameter)?
fir of all you need to open a store object:
store = pd.HDFStore('/path/to/filename.h5')
now you can write (or append) to the store (i'm using here blosc compression - it's pretty fast and efficient), beside that i will use data_columns parameter in order to specify the columns that must be indexed (so you can use these columns in the where parameter later when you will read your HDF5 file):
for f in files:
#read or process each file in/into a separate `df`
store.append('df_identifier_AKA_key', df, data_columns=[list_of_indexed_cols], complevel=5, complib='blosc')
store.close()