I am trying to understand the behavior of Dataset.batch. Here is the code I have used to try to set up iterators on batched data through a Dataset based on numpy arrays.
## experiment with a numpy dataset
sample_size = 100000
ncols = 15
batch_size = 1000
xarr = np.ones([sample_size, ncols]) * [i for i in range(ncols)]
xarr = xarr + np.random.normal(scale = 0.5, size = xarr.shape)
yarr = np.sum(xarr, axis = 1)
self.x_placeholder = tf.placeholder(xarr.dtype, [None, ncols])
self.y_placeholder = tf.placeholder(yarr.dtype, [None, 1])
dataset = tf.data.Dataset.from_tensor_slices((self.x_placeholder, self.y_placeholder))
dataset.batch(batch_size)
self.iterator = dataset.make_initializable_iterator()
X, y = self.iterator.get_next()
However, when I check the shapes of X and y they are
(Pdb) X.shape
TensorShape([Dimension(15)])
(Pdb) y.shape
TensorShape([Dimension(1)])
This is confusing to me because it does not appear that my batch size has been taken into account. It also causes problems downstream when building a model because I expect X and y to have two dimensions, the first dimension being the number of examples in the batch.
Question: Why are the outputs of the iterator one dimensional? How should I batch properly?
Here is what I have tried:
- The
shapesofXandyare the same regardless of whether I apply thebatchfunction to the dataset. - Changing the shape I feed into the placeholders (say by replacing
Nonewithbatch_size) does not change the behavior either.
Thanks for suggestions/corrections, etc.