How to Encode Data of Variable Input Length?

Question

I was doing some data science work when I get stuck with this issue, I'm trying to create a model for a supervised task, where both Input and Output are of variable length.

Here is an example on how the Input and Output look like:

Input[0]: [2.42, 0.43, -5.2, -54.9]
Output[0]: ['class1', 'class3', 'class12']

The problem is that in the dataset that I have, the Inputs and Outputs are of variable length, so in order to be able to use the entire dataset, I need to find a way of encoding this data.

First I encoded the Outputs classes and added a padding to equal all Outputs length in the dataset, (let's say of length=(6,)):

Encoded_output[0]: [1, 3, 12, 0, 0, 0]

But I can't figure out a way of encoding the Input, because, as the original Input data are floats, I cannot create an encoding and add padding. I don't know what other options I have and I would like to hear how would you solve this.

o-90 · Accepted Answer · 2021-02-21 19:47:52Z

A way to solve this is to:

Find out what is the max length that the variable data can be.
Find out what the true length of each training instance is.

From these two things you can create a mask and have your network compute zero gradients for the stuff you want to ignore.

Example:

import tensorflow as tf

# pretend the longest data instance we will have is 5
MAX_SEQ_LEN = 5

# batch of indices indicating true length of each instance
idxs = tf.constant([3, 2, 1])

# batch of variable-length data
rt = tf.ragged.constant(
  [
    [0.234, 0.123, 0.654],
    [0.111, 0.222],
    [0.777],
  ], dtype=tf.float32)

t = rt.to_tensor(shape=[3, MAX_SEQ_LEN])
print(t)
# tf.Tensor(
# [[0.234 0.123 0.654 0.    0.   ]
#  [0.111 0.222 0.    0.    0.   ]
#  [0.777 0.    0.    0.    0.   ]], shape=(3, 5), dtype=float32)

# use indices to create a boolean mask. We can use this mask in
# layers in our network to ignore gradients
mask = tf.sequence_mask(idxs, MAX_SEQ_LEN)
print(mask)
# <tf.Tensor: shape=(3, 5), dtype=bool, numpy=
# array([[ True,  True,  True, False, False],
#        [ True,  True, False, False, False],
#        [ True, False, False, False, False]])>

This use case commonly occurs in RNNs. You can see there is a mask option in the call() method where you can pass a binary mask for variable length time-series data.

Collectives™ on Stack Overflow

How to Encode Data of Variable Input Length?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related