The Tensorflow tutorial here refers to their basic implementation which you can find on github here, where the Tensorflow authors implement word2vec vector embedding training/evaluation with the Skipgram model.
My question is about the actual generation of (target, context) pairs in the generate_batch() function.
On this line Tensorflow authors randomly sample nearby target indices from the "center" word index in the sliding window of words.
However, they also keep a data structure targets_to_avoid to which they add first the "center" context word (which of course we don't want to sample) but ALSO other words after we add them.
My questions are as follows:
- Why sample from this sliding window around the word, why not just have a loop and use them all rather than sampling? It seems strange they would worry about performance/memory in
word2vec_basic.py(their "basic" implementation). - Whatever the answer to 1) is, why are they both sampling and keeping track of what they've selected with
targets_to_avoid? If they wanted truly random, they'd use selection with replacement, and if they wanted to ensure they got all the options, they should have just used a loop and gotten them all in the first place! - Does the built in tf.models.embedding.gen_word2vec work this way too? If so where can I find the source code? (couldn't find the .py file in the Github repo)
Thanks!
