Using Datasets to consume Numpy arrays

Question

I'm trying to use Numpy arrays within a graph, feeding in the data using a Dataset.

I've read through this, but can't quite make sense of how I should feed placeholder arrays within a Dataset.

If we take a simple example, I start with:

A = np.arange(4)
B = np.arange(10, 14)

a = tf.placeholder(tf.float32, [None])
b = tf.placeholder(tf.float32, [None])
c = tf.add(a, b)

with tf.Session() as sess:
    for i in range(10):
        x = sess.run(c, feed_dict={a: A, b:B})
        print(i, x)

Then I attempt to modify it to use a Dataset as follows:

A = np.arange(4)
B = np.arange(10, 14)

a = tf.placeholder(tf.int32, A.shape)
b = tf.placeholder(tf.int32, B.shape)
c = tf.add(a, b)

dataset = tf.data.Dataset.from_tensors((a, b))

iterator = dataset.make_initializable_iterator()

with tf.Session() as sess3:
    sess3.run(tf.global_variables_initializer())
    sess3.run(iterator.initializer, feed_dict={a: A, b: B})

    for i in range(10):
        x = sess3.run(c)
        print(i, x)

If I run this I get 'InvalidArgumentError: You must feed a value for placeholder tensor ...'

The code until the for loop mimics the example here, but I don't get how I can then employ the placeholders a & b without supplying a feed_dict to every call to sess3.run(c) [which would be expensive]. I suspect I have to somehow use the iterator, but I don't understand how.

Update

It appears I oversimplified too much when picking the example. What I am really trying to do is use Datasets when training a neural network, or similar.

For a more sensible question, how would I go about using Datasets to feed placeholders in the below (though imagine X and Y_true are much longer...). The documentation takes me to the point where the loop starts and then I'm not sure.

X = np.arange(8.).reshape(4, 2)
Y_true = np.array([0, 0, 1, 1])

x = tf.placeholder(tf.float32, [None, 2], name='x')
y_true = tf.placeholder(tf.float32, [None], name='y_true')

w = tf.Variable(np.random.randn(2, 1), name='w', dtype=tf.float32)

y = tf.squeeze(tf.matmul(x, w), name='y')

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
                                labels=y_true, logits=y),
                                name='x_entropy')

# set optimiser
optimiser = tf.train.AdamOptimizer().minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    for i in range(100):
        _, loss_out = sess.run([optimiser, loss], feed_dict={x: X, y_true:Y_true})
        print(i, loss_out)

Trying the following only gets me a InvalidArgumentError

X = np.arange(8.).reshape(4, 2)
Y_true = np.array([0, 0, 1, 1])

x = tf.placeholder(tf.float32, [None, 2], name='x')
y_true = tf.placeholder(tf.float32, [None], name='y_true')

dataset = tf.data.Dataset.from_tensor_slices((x, y_true))
iterator = dataset.make_initializable_iterator()

w = tf.Variable(np.random.randn(2, 1), name='w', dtype=tf.float32)

y = tf.squeeze(tf.matmul(x, w), name='y')

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(
                                labels=y_true, logits=y),
                                name='x_entropy')

# set optimiser
optimiser = tf.train.AdamOptimizer().minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    sess.run(iterator.initializer, feed_dict={x: X,
                                              y_true: Y_true})

    for i in range(100):
        _, loss_out = sess.run([optimiser, loss])
        print(i, loss_out)

What do you expect the result of sess3.run(c) to be? The dataset only contains a single element, so even if you used iterator.get_next(), the loop would only perform one iteration before signaling that there are no more elements. — mrry
– mrry, Commented Dec 13, 2017 at 17:18
This is probably not the clearest example - I was trying to show the simplest example I could, but apparently lost the meaning... I'll edit the question — stuart10
– stuart10, Commented Dec 14, 2017 at 8:37

J.E.K · Accepted Answer · 2017-12-13 18:56:06Z

Use iterator.get_next() to get elements from Dataset like:

next_element = iterator.get_next()

than initialize the iterator

sess.run(iterator.initializer, feed_dict={a:A, b:B})

and at least get the values from Dataset

value = sess.run(next_element)

EDIT:

The code above just return the elements from Dataset. The Dataset API is intended to serve features and labels for a input_fn, therefore all additional computations for preprocessing should be performed within the Dataset API. If you want to add elements, you should define a function that is applied to the elements, like:

def add_fn(exp1, exp2):
  return tf.add(exp1, exp2)

and than you can map these function to your Dataset:

dataset = dataset.map(add_fn)

Complete code example:

A = np.arange(4)
B = np.arange(10, 14)
a = tf.placeholder(tf.int32, A.shape)
b = tf.placeholder(tf.int32, B.shape)
#c = tf.add(a, b)
def add_fn(exp1, exp2):
  return tf.add(exp1, exp2)
dataset = tf.data.Dataset.from_tensors((a, b))
dataset = dataset.map(add_fn)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
  sess.run(iterator.initializer, feed_dict={a: A, b: B})
  # just one element at dataset
  x = sess.run(next_element)
  print(x)

mrry · Accepted Answer · 2017-12-15 15:51:58Z

The problem in your more complicated example is that you use the same tf.placeholder() nodes as the input to the Dataset.from_tensor_slices() (which is correct) and the network itself (which causes the InvalidArgumentError. Instead, as J.E.K points out in their answer, you should use iterator.get_next() as the input to your network, as follows (note that there are a couple of other fixes I added to make the code run as-is):

X = np.arange(8.).reshape(4, 2)
Y_true = np.array([0, 0, 1, 1])

x = tf.placeholder(tf.float32, [None, 2], name='x')
y_true = tf.placeholder(tf.float32, [None], name='y_true')

dataset = tf.data.Dataset.from_tensor_slices((x, y_true))

# You will need to repeat the input (which has 4 elements) to be able to take
# 100 steps.
dataset = dataset.repeat()

iterator = dataset.make_initializable_iterator()

# Use `iterator.get_next()` to create tensors that will consume values from the
# dataset.
x_next, y_true_next = iterator.get_next()

w = tf.Variable(np.random.randn(2, 1), name='w', dtype=tf.float32)

# The `x_next` tensor is a vector (i.e. a row of `X`), so you will need to
# convert it to a matrix or apply batching in the dataset to make it work with
# `tf.matmul()`
x_next = tf.expand_dims(x_next, 0)

y = tf.squeeze(tf.matmul(x_next, w), name='y')  # Use `x_next` here.

loss = tf.reduce_mean(
    tf.nn.sigmoid_cross_entropy_with_logits(
        labels=y_true_next, logits=y),  # Use `y_true_next` here.
    name='x_entropy')

# set optimiser
optimiser = tf.train.AdamOptimizer().minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    sess.run(iterator.initializer, feed_dict={x: X,
                                              y_true: Y_true})

    for i in range(100):
        _, loss_out = sess.run([optimiser, loss])
        print(i, loss_out)

Great, I think that has got me over the conceptual 'hump' so that I now have some idea of how Dataset works, thanks!

Collectives™ on Stack Overflow

Using Datasets to consume Numpy arrays

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related