1

I was doing some data science work when I get stuck with this issue, I'm trying to create a model for a supervised task, where both Input and Output are of variable length.

Here is an example on how the Input and Output look like:

Input[0]: [2.42, 0.43, -5.2, -54.9]
Output[0]: ['class1', 'class3', 'class12']

The problem is that in the dataset that I have, the Inputs and Outputs are of variable length, so in order to be able to use the entire dataset, I need to find a way of encoding this data.

First I encoded the Outputs classes and added a padding to equal all Outputs length in the dataset, (let's say of length=(6,)):

Encoded_output[0]: [1, 3, 12, 0, 0, 0]

But I can't figure out a way of encoding the Input, because, as the original Input data are floats, I cannot create an encoding and add padding. I don't know what other options I have and I would like to hear how would you solve this.

1 Answer 1

2

A way to solve this is to:

  1. Find out what is the max length that the variable data can be.
  2. Find out what the true length of each training instance is.

From these two things you can create a mask and have your network compute zero gradients for the stuff you want to ignore.

Example:

import tensorflow as tf

# pretend the longest data instance we will have is 5
MAX_SEQ_LEN = 5

# batch of indices indicating true length of each instance
idxs = tf.constant([3, 2, 1])

# batch of variable-length data
rt = tf.ragged.constant(
  [
    [0.234, 0.123, 0.654],
    [0.111, 0.222],
    [0.777],
  ], dtype=tf.float32)

t = rt.to_tensor(shape=[3, MAX_SEQ_LEN])
print(t)
# tf.Tensor(
# [[0.234 0.123 0.654 0.    0.   ]
#  [0.111 0.222 0.    0.    0.   ]
#  [0.777 0.    0.    0.    0.   ]], shape=(3, 5), dtype=float32)

# use indices to create a boolean mask. We can use this mask in
# layers in our network to ignore gradients
mask = tf.sequence_mask(idxs, MAX_SEQ_LEN)
print(mask)
# <tf.Tensor: shape=(3, 5), dtype=bool, numpy=
# array([[ True,  True,  True, False, False],
#        [ True,  True, False, False, False],
#        [ True, False, False, False, False]])>

This use case commonly occurs in RNNs. You can see there is a mask option in the call() method where you can pass a binary mask for variable length time-series data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.