I am trying to load a pretrained Word2Vec (or Glove) embedding in my Tensorflow code, however I have some problems understanding it as I cannot find many examples. The question is not about getting and loading the embedding matrix, which I understand, but about looking up the word ids. Currently I am using the code from https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/. There, first the embedding matrix is loaded (understood). Then, a vocabulary processor is used to convert a sentence x to a list of word IDs:
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain = vocab_processor.fit(vocab)
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))
This works and gives me a list of word ids, but I do not know if this is correct. What bothers me most is the question how the vocabulary processor gets the correct word ids from the embedding I just read (since otherwise the result of the embedding would be wrong). Does the fit step do this?
Or is there another way, how do you do this lookup?
Thanks! Oliver