Load Pretrained Word2Vec Embedding in Tensorflow

Question

I am trying to load a pretrained Word2Vec (or Glove) embedding in my Tensorflow code, however I have some problems understanding it as I cannot find many examples. The question is not about getting and loading the embedding matrix, which I understand, but about looking up the word ids. Currently I am using the code from https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/. There, first the embedding matrix is loaded (understood). Then, a vocabulary processor is used to convert a sentence x to a list of word IDs:

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain = vocab_processor.fit(vocab)
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

This works and gives me a list of word ids, but I do not know if this is correct. What bothers me most is the question how the vocabulary processor gets the correct word ids from the embedding I just read (since otherwise the result of the embedding would be wrong). Does the fit step do this?

Or is there another way, how do you do this lookup?

Thanks! Oliver

musically_ut · Accepted Answer · 2017-04-27 13:19:01Z

1

Yes, the fit step tells the vocab_processor the index of each word (starting from 1) in the vocab array. transform just reversed this lookup and produces the index from the words and uses 0 to pad the output to the max_document_size.

You can see that in a short example here:

vocab_processor = learn.preprocessing.VocabularyProcessor(5)
vocab = ['a', 'b', 'c', 'd', 'e']
pretrain = vocab_processor.fit(vocab)

pretrain == vocab_processor
# True

np.array(list(pretrain.transform(['a b c', 'b c d', 'a e', 'a b c d e'])))

# array([[1, 2, 3, 0, 0],
#        [2, 3, 4, 0, 0],
#        [1, 5, 0, 0, 0],
#        [1, 2, 3, 4, 5]])
#

answered Apr 27, 2017 at 13:19

musically_ut

34.3k9 gold badges98 silver badges110 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Steven Du Over a year ago

OK, after reading your post twice , I got the idea. When reading an embedding matrix from w2v or grove , there will be a data matrix, and a list of words (the vocab in this case), just use vocab_processor to fit this vocab will do the trick.

Collectives™ on Stack Overflow

Load Pretrained Word2Vec Embedding in Tensorflow

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related