I'm trying to create an embedding for two unique input sequences. So for each observation, take in a sequence of integer symbols and a time series vector, to create an embedding vector. It seems like a standard approach with one input is to create an autoencoder, have the data as both input and output, and extract a hidden layer's output as your embedding.
I'm using keras, and it seems like I'm almost there. Input 1 is of shape (1000000, 50) (a million lists of integers of length 50). Input 2 is of shape (1000000, 50, 1).
Below is my keras code.
##########################################
# Input 1: event type sequences
input_1a = Input(shape =(max_seq_length,), dtype = 'int32', name = 'first_input')
# Input 1: Embedding layer
input_1b = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(input_1a)
# Input 1: LSTM
input_1c = LSTM(10, return_sequences = True)(input_1b)
##########################################
# Input 2: unix time (minutes) vectors
input_2a = Input(shape=(max_seq_length,1), dtype='float32', name='second_input')
# Input 2: Masking
input_2b = Masking(mask_value = 99999999.0)(input_2a)
# Input 2: LSTM
input_2c = LSTM(10, return_sequences = True)(input_2b)
##########################################
# Concatenation layer here
x = keras.layers.concatenate([input_1c, input_2c])
x2 = Dense(40, activation='relu')(x)
x3 = Dense(20, activation='relu', name = "journey_embeddings")(x2)
##########################################
# Re-create the inputs
xl = Lambda(lambda x: x, output_shape=lambda s:s)(x3)
xf = Flatten()(xl)
xf1 = Dense(20, activation='relu')(xf)
xf2 = Dense(50, activation='relu')(xf1)
xd = Dense(20, activation='relu')(x3)
xd2 = TimeDistributed(Dense(1, activation='linear'))(xd)
##########################################
## Compile and fit the model
model = Model(inputs=[input_1a, input_2a], outputs=[xf2,xd2])
model.compile(optimizer = rms_prop, loss = 'mse')
print(model.summary())
np.random.seed(21)
model.fit([X1,X2], [X1,X2], epochs=1, batch_size=200)
Once I run this, I extract the "journey_embeddings" hidden layer output like this:
layer_name = 'journey_embeddings'
intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer(layer_name).output)
intermediate_output = intermediate_layer_model.predict([X1,X2])
However, the shape of intermediate_output is (1000000, 50, 20). Id like to get an embedding vector of length 20. How is it possible to get a shape of (1000000, 20)?