Tensorflow: Load data in multiple threads on cpu

Question

I have a python class SceneGenerator which has multiple member functions for preprocessing and a generator function generate_data(). The basic structure is like this:

class SceneGenerator(object):
    def __init__(self):
       # some inits

    def generate_data(self):
        """
        Generator. Yield data X and labels y after some preprocessing
        """
        while True:
            # opening files, selecting data
            X,y = self.preprocess(some_params, filenames, ...)            

            yield X, y

I used the class member function sceneGenerator.generate_data() in keras model.fit_generator() function to read the data from disk, preprocess it and yield it. In keras, this is done on multiple CPU threads, if the workers parameter of model.fit_generator() is set to something > 1.

I now want to use the same SceneGenerator class in tensorflow. My current approach is this:

sceneGenerator = SceneGenerator(some_params)
for X, y in sceneGenerator.generate_data():

    feed_dict = {ops['data']: X,
                 ops['labels']: y,
                 ops['is_training_pl']: True
                 }
    summary, step, _, loss, prediction = sess.run([optimization_op, loss_op, pred_op],
                                                  feed_dict=feed_dict)

This, however, is slow and does not use multiple threads. I found the tf.data.Dataset api with some documentation, but I fail to implement the methods.

Edit: Notice that I do not work with images so that the image loading mechanisms with file paths etc. do not work here. My SceneGenerator loads data from hdf5 files. But not complete datasets but - depending on the initialization parameters - only parts of a dataset. I would love to keep the generator function as it is and learn how this generator can be directly used as input for tensorflow and runs on multiple threads on the CPU. Rewriting the data from the hdf5 files to csv is not a good option because it duplicated lots of data.

Edit 2:: I think something similar to this could help: parallelising tf.data.Dataset.from_generator

I updated my source code to make it cleaner.

Merlin1896
– Merlin1896

2017-12-13 17:26:37 +00:00
Commented Dec 13, 2017 at 17:26 — Merlin1896
– Merlin1896, Commented Dec 13, 2017 at 17:26

GPhilo · Accepted Answer · 2017-12-13 14:45:53Z

12

+50

Assuming you're using the latest Tensorflow (1.4 at the time of this writing), you can keep the generator and use the tf.data.* API as follows (I chose arbitrary values for the thread number, prefetch buffer size, batch size and output data types):

NUM_THREADS = 5
sceneGen = SceneGenerator()
dataset = tf.data.Dataset.from_generator(sceneGen.generate_data, output_types=(tf.float32, tf.int32))
dataset = dataset.map(lambda x,y : (x,y), num_parallel_calls=NUM_THREADS).prefetch(buffer_size=1000)
dataset = dataset.batch(42)
X, y = dataset.make_one_shot_iterator().get_next()

To show that it's actually multiple threads extracting from the generator, I modified your class as follows:

import threading    
class SceneGenerator(object):
  def __init__(self):
    # some inits
    pass

  def generate_data(self):
    """
    Generator. Yield data X and labels y after some preprocessing
    """
    while True:
      # opening files, selecting data
      X,y = threading.get_ident(), 2 #self.preprocess(some_params, filenames, ...)            
      yield X, y

This way, creating a Tensorflow session and getting one batch shows the thread IDs of the threads getting the data. On my pc, running:

sess = tf.Session()
print(sess.run([X, y]))

prints

[array([  8460.,   8460.,   8460.,  15912.,  16200.,  16200.,   8460.,
         15912.,  16200.,   8460.,  15912.,  16200.,  16200.,   8460.,
         15912.,  15912.,   8460.,   8460.,   6552.,  15912.,  15912.,
          8460.,   8460.,  15912.,   9956.,  16200.,   9956.,  16200.,
         15912.,  15912.,   9956.,  16200.,  15912.,  16200.,  16200.,
         16200.,   6552.,  16200.,  16200.,   9956.,   6552.,   6552.], dtype=float32),
 array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])]

Note: You might want to experiment removing the map call (that we only use to have the multiple threads) and checking if the prefetch's buffer is enough to remove the bottleneck in your input pipeline (even with only one thread, often the input preprocessing is faster than the actual graph execution, so the buffer is enough to have the preprocessing go as fast as it can).

answered Dec 13, 2017 at 14:45

GPhilo

19.3k9 gold badges70 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Merlin1896 Over a year ago

With X, y from the generator: Would I still use a feed_dict?

GPhilo Over a year ago

You don't use X and y from the generator, you use those from the iterator.get_next() as input tensors of your model (they go where you're currently using the placeholders). This way, you won't need to feed anything when doing sess.run().

Merlin1896 Over a year ago

Ahh, I see! So I just get rid of the placeholders and use X and y instead. And during sess.run() Tensorflow automatically gets the next elements, right? Can I reset the iterator or do I have to create a new one for each epoch?

GPhilo Over a year ago

That's right, Tensorflow takes care of fetching the next batch from the dataset. The way you coded the generator, the loop is infinite so the iterator won't ever end, but if you want to change that, there is an initializable iterator (have a look at the "reading data" tutorial on the Tensorflow website, I'll put a link here once I'm back on my PC)

Merlin1896 Over a year ago

Ok, understood. I should have updated my code here to the state it is in my local scripts. There the generator actually ends after some runs and that finishes my epoch.

|

Maxim · Accepted Answer · 2017-12-11 17:01:35Z

3

Running a session with a feed_dict is indeed pretty slow:

Feed_dict does a single-threaded memcpy of contents from Python runtime into TensorFlow runtime.

A faster way to feed the data is by using tf.train.string_input_producer + *Reader + tf.train.Coordinator, which will batch the data in multiple threads. For that, you read the data directly into tensors, e.g., here's a way to read and process a csv file:

def batch_generator(filenames):
  filename_queue = tf.train.string_input_producer(filenames)
  reader = tf.TextLineReader(skip_header_lines=1)
  _, value = reader.read(filename_queue)

  content = tf.decode_csv(value, record_defaults=record_defaults)
  content[4] = tf.cond(tf.equal(content[4], tf.constant('Present')),
                       lambda: tf.constant(1.0),
                       lambda: tf.constant(0.0))

  features = tf.stack(content[:N_FEATURES])
  label = content[-1]

  data_batch, label_batch = tf.train.shuffle_batch([features, label],
                                                   batch_size=BATCH_SIZE,
                                                   capacity=20*BATCH_SIZE,
                                                   min_after_dequeue=10*BATCH_SIZE)
  return data_batch, label_batch

This function gets the list of input files, creates the reader and data transformations and outputs the tensors, which are evaluated to the contents of these files. Your scene generator is likely to do different transformations, but the idea is the same.

Next, you start a tf.train.Coordinator to parallelize this:

with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for _ in range(10):  # generate 10 batches
        features, labels = sess.run([data_batch, label_batch])
        print(features)
    coord.request_stop()
    coord.join(threads)

In my experience, this way feeds the data much faster and allows to utilize the whole available GPU power. Complete working example can be found here.

answered Dec 11, 2017 at 17:01

Maxim

53.9k27 gold badges161 silver badges213 bronze badges

3 Comments

Merlin1896 Over a year ago

Thank you for your input! Well, my data is not in csv format but multiple hdf5 files from which I only read parts. I do not want to duplicate the data into csv files, because different parameters in my SceneGenerator would force me to write again all the data to a new set of csv files. My question is therefore precisely, how I can use a python generator for creating the input. I bet there is a more direct way than first writing all the training (and test) data into csv files.

Maxim Over a year ago

@Merlin1896 I see. Well, hdf5 support in tensorflow... leaves much to be desired. Do you consider third-party libraries? Like this one: github.com/ghcollin/tftables

Merlin1896 Over a year ago

Thanks for the hint, I will look into this!

Collectives™ on Stack Overflow

Tensorflow: Load data in multiple threads on cpu

2 Answers 2

8 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related