7

I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.

For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?

BERT input representation (from the paper)

Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?

Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?

1 Answer 1

6

Ah, but you see, BERT does not include a Transformer decoder. It is only the encoder part, with a classifier added on top.

For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words. Classifying Non-masked is not included in the classification task and does not effect loss.

BERT is also trained on predicting whether a pair of sentences really does precedes one another or not.

I do not remember how the two losses are weighted.

I hope this draws a clearer picture.

Sign up to request clarification or add additional context in comments.

3 Comments

I found the text in the paper! Thanks! Adding it here for reference: "Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”"
I have a follow-up: how is the classifier setup when there are multiple words masked? Is the same linear layer used for all the masked words? Or are there "parallel" linear layers which each classify one masked word?
Please ask your new question... well... As a new question ;) This will improve the usability to the community. Feel free to post the link to that question in the comments to this answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.