Memory leak using variable len feature in tensorflow

Question

We have tensorflow application in which we feed data via queues in batched of 250. After moving to use VarLenFeature (instead of FixedLenFeature) we started to have memory leak during training where the memory was constantly increasing. We are training our models using GPU machines.

This is the decode code:

@staticmethod
def decode(serialized_example):
    features = tf.parse_example(
        serialized_example,
        # Defaults are not specified since both keys are required.
        features={
            # target_features
            RECS: tf.VarLenFeature(tf.float32),
            CLICK: tf.FixedLenFeature([], tf.float32)
        })
    return features

then we convert the sparse to dense using:

tf.identity(tf.sparse_tensor_to_dense(tensor), name=key)

and then we loop over with batched over tensorflow queues

This is the create queue code:

@staticmethod
def create_queue(tensors, capacity, shuffle=False, min_after_dequeue=None, seed=None,
                 enqueue_many=False, shapes=None, shared_name=None, name=None):
    tensor_list = _as_tensor_list(tensors)
    with ops.name_scope(name, "shuffle_batch_queue", list(tensor_list)):
        tensor_list = _validate(tensor_list)

        tensor_list, sparse_info = _store_sparse_tensors(
            tensor_list, enqueue_many, tf.constant(True))
        map_op = [x.map_op for x in sparse_info]
        types = _dtypes([tensor_list])
        shapes = _shapes([tensor_list], shapes, enqueue_many)

        queue = data_flow_ops.RandomShuffleQueue(
            capacity=capacity, min_after_dequeue=min_after_dequeue, seed=seed,
            dtypes=types, shapes=shapes, shared_name=shared_name)

    return queue, sparse_info, map_op

And the enqueue operation is:

@staticmethod
def enqueue(queue, tensors, num_threads, enqueue_many=False, name=None, map_op = None):
    tensor_list = _as_tensor_list(tensors)
    with ops.name_scope(name, "shuffle_batch_equeue", list(tensor_list)):
        tensor_list = _validate(tensor_list)
        tensor_list, sparse_info = _store_sparse_tensors(
            tensor_list, enqueue_many, tf.constant(True), map_op)
        _enqueue(queue, tensor_list, num_threads, enqueue_many, tf.constant(True))
    return queue, sparse_info

Eugene Brevdo · Accepted Answer · 2017-09-26 17:20:38Z

1

Can you provide a minimal example? e.g., do you continue to have the memory leak if you just call the example parsing over and over again via multiple session.run calls, and not have any queues?

The reason I ask is that the _store_sparse_tensors is hidden to that file for a reason; if you misuse it, you will hit a memory leak. Thus all callers of this function must be very careful to use it correctly. For every sparse tensor stored via _store_sparse_tensors, that same tensor must be restored via _restore_sparse_tensors. If it is not, you will leak memory.

I'm considering a DT_VARIANT storage format to replace this wrapper, but for now I'd recommend against using these functions yourself. Instead, you can probably do what you want using the new tf.contrib.data (soon to be tf.data) libraries!

answered Sep 26, 2017 at 17:20

Eugene Brevdo

8997 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

ofer-a Over a year ago

Before the use of VarLenFeature there was no memory leak. The only change that i didn't pass map_op in the enqueue method i.e.: tensor_list, sparse_info = _store_sparse_tensors(tensor_list, enqueue_many, tf.constant(True))

Eugene Brevdo Over a year ago

Right. Can you tell me if you have a memory leak if you are just parsing the VarLenFeature -- not using _store_sparse_tensors?

ofer-a Over a year ago

Not using _store_sparse_tensors or using it without map_op from the previous call is not working. I think i'll try your suggestion to use tf.contrib.data. Do you know if it supports multithreaded processing? i have multiple hdfs files with tfrecords which i want to feed in parallel

Eugene Brevdo Over a year ago

yes, there are arguments to the .map function to enable parallelism. see also the interleave and prefetch methods in the tf nightlies

ofer-a Over a year ago

I tried tf.contrib.data but it doesn't support SpaeseTesors at least the TFRecordDataset the map operation fails with exception "TypeError: Failed to convert object of type <class 'tensorflow.python.framework.sparse_tensor.SparseTensor'> to Tensor " do you know if it should work?

|

Collectives™ on Stack Overflow

Memory leak using variable len feature in tensorflow

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related