1

I am trying to load a pretrained Word2Vec (or Glove) embedding in my Tensorflow code, however I have some problems understanding it as I cannot find many examples. The question is not about getting and loading the embedding matrix, which I understand, but about looking up the word ids. Currently I am using the code from https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/. There, first the embedding matrix is loaded (understood). Then, a vocabulary processor is used to convert a sentence x to a list of word IDs:

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain = vocab_processor.fit(vocab)
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

This works and gives me a list of word ids, but I do not know if this is correct. What bothers me most is the question how the vocabulary processor gets the correct word ids from the embedding I just read (since otherwise the result of the embedding would be wrong). Does the fit step do this?

Or is there another way, how do you do this lookup?

Thanks! Oliver

1 Answer 1

1

Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

You can see that in a short example here:

vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])
# 
Sign up to request clarification or add additional context in comments.

1 Comment

OK, after reading your post twice , I got the idea. When reading an embedding matrix from w2v or grove , there will be a data matrix, and a list of words (the vocab in this case), just use vocab_processor to fit this vocab will do the trick.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.