I was reading the BERT paper and was not clear regarding the inputs to the transformer encoder and decoder.
For learning masked language model (Cloze task), the paper says that 15% of the tokens are masked and the network is trained to predict the masked tokens. Since this is the case, what are the inputs to the transformer encoder and decoder?
Is the input to the transformer encoder this input representation (see image above). If so, what is the decoder input?
Further, how is the output loss computed? Is it a softmax for only the masked locations? For this, the same linear layer is used for all masked tokens?
