0

I'm interested in implementing a LinkNet based encoder-decoder structure for semantic segmentation on a custom dataset. I'm trying to introduce convLSTM layers between the encoder and decoder. Typically, as expected, the output of the encoder is a 4-dim output (batch_size, channels, height, width). The convLSTM layers expect a 5-dim input (batch_size, sequence_length, channels, height, width). How do I convert this 4-dim tensor to a 5-dim tensor, without any loss of information? I initially thought of splitting the batch_size to accommodate the sequence_length as well, but that might be a problem since I'm dealing with video frames.

Maybe I'm looking at using sequences of four/five frames for training i.e. the semantic segmentation map of frame t is determined by means of the info of the last three to four frames, and hence, a sequence_length of 4 or 5 would do.

How do I introduce the sequence length? Is it during pre-processing or right after the encoder structure?

Most importantly, HOW TO DO IT?

1 Answer 1

0

You can't. ConvLSTM expect a sequence, which is the dimension you are missing. LinkNet only takes one image as an input, so you can't really use ConvLSTM inside Linknet.

Sign up to request clarification or add additional context in comments.

2 Comments

They use sequences of frames. arxiv.org/pdf/1905.01058.pdf
If I understand correctly, you have to use the convlstm as encoder and decoder

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.