I have a python class SceneGenerator which has multiple member functions for preprocessing and a generator function generate_data(). The basic structure is like this:
class SceneGenerator(object):
def __init__(self):
# some inits
def generate_data(self):
"""
Generator. Yield data X and labels y after some preprocessing
"""
while True:
# opening files, selecting data
X,y = self.preprocess(some_params, filenames, ...)
yield X, y
I used the class member function sceneGenerator.generate_data() in keras model.fit_generator() function to read the data from disk, preprocess it and yield it. In keras, this is done on multiple CPU threads, if the workers parameter of model.fit_generator() is set to something > 1.
I now want to use the same SceneGenerator class in tensorflow. My current approach is this:
sceneGenerator = SceneGenerator(some_params)
for X, y in sceneGenerator.generate_data():
feed_dict = {ops['data']: X,
ops['labels']: y,
ops['is_training_pl']: True
}
summary, step, _, loss, prediction = sess.run([optimization_op, loss_op, pred_op],
feed_dict=feed_dict)
This, however, is slow and does not use multiple threads. I found the tf.data.Dataset api with some documentation, but I fail to implement the methods.
Edit: Notice that I do not work with images so that the image loading mechanisms with file paths etc. do not work here.
My SceneGenerator loads data from hdf5 files. But not complete datasets but - depending on the initialization parameters - only parts of a dataset. I would love to keep the generator function as it is and learn how this generator can be directly used as input for tensorflow and runs on multiple threads on the CPU. Rewriting the data from the hdf5 files to csv is not a good option because it duplicated lots of data.
Edit 2:: I think something similar to this could help: parallelising tf.data.Dataset.from_generator