0

I'm trying to create an embedding for two unique input sequences. So for each observation, take in a sequence of integer symbols and a time series vector, to create an embedding vector. It seems like a standard approach with one input is to create an autoencoder, have the data as both input and output, and extract a hidden layer's output as your embedding.

I'm using keras, and it seems like I'm almost there. Input 1 is of shape (1000000, 50) (a million lists of integers of length 50). Input 2 is of shape (1000000, 50, 1).

Below is my keras code.

##########################################

# Input 1: event type sequences
input_1a = Input(shape =(max_seq_length,), dtype = 'int32', name = 'first_input')

# Input 1: Embedding layer
input_1b = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(input_1a)

# Input 1: LSTM 
input_1c = LSTM(10, return_sequences = True)(input_1b)


##########################################

# Input 2: unix time (minutes) vectors
input_2a = Input(shape=(max_seq_length,1), dtype='float32', name='second_input')

# Input 2: Masking 
input_2b = Masking(mask_value = 99999999.0)(input_2a)

# Input 2: LSTM 
input_2c = LSTM(10, return_sequences = True)(input_2b)


##########################################

# Concatenation layer here
x = keras.layers.concatenate([input_1c, input_2c])
x2 = Dense(40, activation='relu')(x)
x3 = Dense(20, activation='relu', name = "journey_embeddings")(x2)

##########################################

# Re-create the inputs
xl = Lambda(lambda x: x, output_shape=lambda s:s)(x3)
xf = Flatten()(xl)
xf1 = Dense(20, activation='relu')(xf)
xf2 = Dense(50, activation='relu')(xf1)

xd = Dense(20, activation='relu')(x3)
xd2 = TimeDistributed(Dense(1, activation='linear'))(xd)


##########################################

## Compile and fit the model
model = Model(inputs=[input_1a, input_2a], outputs=[xf2,xd2])
model.compile(optimizer = rms_prop, loss = 'mse')
print(model.summary())
np.random.seed(21)
model.fit([X1,X2], [X1,X2], epochs=1, batch_size=200)

Once I run this, I extract the "journey_embeddings" hidden layer output like this:

layer_name = 'journey_embeddings'
intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict([X1,X2])

However, the shape of intermediate_output is (1000000, 50, 20). Id like to get an embedding vector of length 20. How is it possible to get a shape of (1000000, 20)?

2 Answers 2

2

You use return_sequences=True in your LSTMs and return again a timeseries rather than encoding the sequence into a single vector of size 20. This returns shape (.., 50, 20) as it outputs hidden state of LSTM at every timestep. Presumably you want to encode all 50 timesteps into a single vector, then you shouldn't return sequences.

Sign up to request clarification or add additional context in comments.

2 Comments

That makes sense. The problem is when I change return_sequences to False, and remove the flatten layer, I need X2, which is now 2d, to be shape (1000000, 50, 1). Is there a way to do that?
You can use a Reshape layer to add the extra dimension.
1

Thanks to @nuric, the following code works:

##########################################

# Input 1: event type sequences
input_1a = Input(shape =(max_seq_length,), dtype = 'int32', name = 'first_input')

# Input 1: Embedding layer
input_1b = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(input_1a)

# Input 1: LSTM 
input_1c = LSTM(10, return_sequences = False)(input_1b)


##########################################

# Input 2: unix time (minutes) vectors
input_2a = Input(shape=(max_seq_length,1), dtype='float32', name='second_input')

# Input 2: Masking 
input_2b = Masking(mask_value = 99999999.0)(input_2a)

# Input 2: LSTM 
input_2c = LSTM(10, return_sequences = False)(input_2b)


##########################################

# Concatenation layer here
x = keras.layers.concatenate([input_1c, input_2c])
x2 = Dense(40, activation='relu')(x)
x3 = Dense(20, activation='relu', name = "journey_embeddings")(x2)

##########################################

# An abitrary number of dense, hidden layers here
xf1 = Dense(20, activation='relu')(x3)
xf2 = Dense(50, activation='relu')(xf1)

xd = Dense(50, activation='relu')(x3)
xd2 = Reshape((50, 1))(xd)


##########################################

## Compile and fit the model
model = Model(inputs=[input_1a, input_2a], outputs=[xf2,xd2])
model.compile(optimizer = rms_prop, loss = 'mse')
print(model.summary())
np.random.seed(21)
model.fit([X1,X2], [X1,X2], epochs=1, batch_size=200)

2 Comments

I'm curious how this general approach wound up working for you? Was the middle layer of the model "informative" enough to cluster on?
Hi Dylan, if there is decent signal in the inputs it seemed to work reasonably well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.