I'm trying to denoise text using a T5 model following the Huggingface doc:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()
But I can't figure out how to get the actual text that corresponds to the masked input. They only show how to get the loss and mention
the forward function automatically creates the correct decoder_input_ids
I tried the following:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
tokenizer.batch_decode(logits.argmax(-1))
But the output doesn't make sense:
['<extra_id_0> park park<extra_id_1> the<extra_id_2> park']
I don't care for the loss, nor do I have labels in my setting. I just have text with masked tokens that I need to fill:
my_masked_text = [
"The kid went to the [MASK]",
"The dog likes [MASK] and also [MASK]"
]