What are the inputs to the transformer encoder and decoder in BERT?

Question

I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.

For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?

Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?

Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?

user2182857 · Accepted Answer · 2020-12-06 21:00:17Z

6

Ah, but you see, BERT does not include a Transformer decoder. It is only the encoder part, with a classifier added on top.

For masked word prediction, the classifier acts as a decoder of sorts, trying to reconstruct the true identities of the masked words. Classifying Non-masked is not included in the classification task and does not effect loss.

BERT is also trained on predicting whether a pair of sentences really does precedes one another or not.

I do not remember how the two losses are weighted.

I hope this draws a clearer picture.

edited Dec 6, 2020 at 21:00

answered Feb 24, 2020 at 22:24

user2182857

7388 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

mysticsasuke Over a year ago

I found the text in the paper! Thanks! Adding it here for reference: "Model Architecture BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017) and released in the tensor2tensor library.1 Because the use of Transformers has become common and our implementation is almost identical to the original, we will omit an exhaustive background description of the model architecture and refer readers to Vaswani et al. (2017) as well as excellent guides such as “The Annotated Transformer.”"

mysticsasuke Over a year ago

I have a follow-up: how is the classifier setup when there are multiple words masked? Is the same linear layer used for all the masked words? Or are there "parallel" linear layers which each classify one masked word?

user2182857 Over a year ago

Please ask your new question... well... As a new question ;) This will improve the usability to the community. Feel free to post the link to that question in the comments to this answer.

Collectives™ on Stack Overflow

What are the inputs to the transformer encoder and decoder in BERT?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related