1

I have a nested numpy array that I want to feed to an RNN model in Tensorflow/Keras. Predictions will be made at the person level so that's the first dimension of the array. Each person will have 1 or more events (2nd dimension) and each event will have 1 or more codes (3rd dimension). In other words, dimensions 2 and 3 have varying lengths.

For the first version of the training code, I loaded all the data in memory and sliced it/padded the mini-batches as needed during training. Slicing/padding is a Sequence Keras class that process numpy arrays.

Now I have way more data to train the model so I cannot load it all in memory. The plan is to save it into several TFRecord files and then load/pad them in small batches as needed during training.

I am using Tensorflow 1.14.0 with Python 3.6.

Since the inner dimensions are of varying length, I have tried to use tf.data.Dataset.from_generator.

Question: how to fix the minimal example below (if possible)?

codes = np.array([np.array([np.array([527,  38, 734]),
                            np.array([  4, 935])]),
                  np.array([np.array([810])]),
                  np.array([np.array([315, 802])]),
                  np.array([np.array([317,  29, 861]),
                            np.array([906]),
                            np.array([439, 655, 893, 130])])])

codes_dataset = tf.data.Dataset.from_generator(lambda: codes, (tf.int64, tf.int64))

print(codes_dataset)
# <DatasetV1Adapter shapes: (<unknown>, <unknown>), types: (tf.int64, tf.int64)>

for value in codes_dataset:
    print(value)

codes_dataset is created but the for loop errors out:

---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-129-d4ed489ff27f> in <module>()
----> 1 for value in codes_dataset:
      2     print(value)

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in __next__(self)
    584 
    585   def __next__(self):  # For Python 3 compatibility
--> 586     return self.next()
    587 
    588   def _next_internal(self):

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in next(self)
    621     """
    622     try:
--> 623       return self._next_internal()
    624     except errors.OutOfRangeError:
    625       raise StopIteration

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py in _next_internal(self)
    613             self._iterator_resource,
    614             output_types=self._flat_output_types,
--> 615             output_shapes=self._flat_output_shapes)
    616 
    617       return self._structure._from_compatible_tensor_list(ret)  # pylint: disable=protected-access

~opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py in iterator_get_next_sync(iterator, output_types, output_shapes, name)
   2118       else:
   2119         message = e.message
-> 2120       _six.raise_from(_core._status_to_exception(e.code, message), None)
   2121   # Add nodes to the TensorFlow graph.
   2122   if not isinstance(output_types, (list, tuple)):

~opt/tools/python/anaconda3/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

InvalidArgumentError: TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527,  38, 734]) array([  4, 935])].
Traceback (most recent call last):

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 520, in generator_py_func
    flattened_values = nest.flatten_up_to(output_types, values)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 398, in flatten_up_to
    assert_shallow_structure(shallow_tree, input_tree)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/util/nest.py", line 301, in assert_shallow_structure
    "Input has type: %s." % type(input_tree))

TypeError: If shallow structure is a sequence, input must also be a sequence. Input has type: <class 'numpy.ndarray'>.


During handling of the above exception, another exception occurred:


Traceback (most recent call last):

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 209, in __call__
    ret = func(*args)

  File "/opt/tools/python/anaconda3/lib/python3.6/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 525, in generator_py_func
    "element was %s." % (output_types, values))

TypeError: `generator` yielded an element that did not match the expected structure. The expected structure was (tf.int64, tf.int64), but the yielded element was [array([527,  38, 734]) array([  4, 935])].


     [[{{node PyFunc}}]] [Op:IteratorGetNextSync]

4
  • Why do you think using a generator will make variable size arrays more acceptable? Is there something in the tensorflow docs about that? Commented Jul 24, 2019 at 0:34
  • I decided to try from_generator because of Derek Murray's responses/coments: stackoverflow.com/questions/47580716/… stackoverflow.com/questions/46511328/… Commented Jul 24, 2019 at 2:51
  • Wouldn't it be easier to detect a pattern if all samples had the same shape or number of features? Commented Jul 24, 2019 at 7:03
  • I can't load all the data in memory so it would not be easy to separate them by shape. The fact that the size is variable in 2 dimensions doesn't help either. There are a lot of different shapes and some will have 1 or very few samples. Commented Jul 24, 2019 at 14:38

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.