0

I'm trying to denoise text using a T5 model following the Huggingface doc:

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()

But I can't figure out how to get the actual text that corresponds to the masked input. They only show how to get the loss and mention

the forward function automatically creates the correct decoder_input_ids

I tried the following:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
tokenizer.batch_decode(logits.argmax(-1))

But the output doesn't make sense:

['<extra_id_0> park park<extra_id_1> the<extra_id_2> park']

I don't care for the loss, nor do I have labels in my setting. I just have text with masked tokens that I need to fill:

my_masked_text = [
"The kid went to the [MASK]",
"The dog likes [MASK] and also [MASK]"
]

1 Answer 1

0

In the in docs for T5 (§ Inference) there is an example of what you're looking for.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)

for the result that sequences is the following:

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

Where this to be interpreted as follows:

'<pad>'                   # Marks beginning of output sequence 
'<extra_id_0> park offers'# <- model prediction for first blank
'<extra_id_1> the'        # <- model prediction for second blank
'<extra_id_2> park.</s>'  # ignore (there was no third blank)

So the model filled in the blanks as

"The park offers walks in the park"


For your examples, that means you'd do something like the following (haven't tested this, but it should work modulo some typo):

my_masked_text = [
  "The kid went to the <extra_id_0>.",
  "The dog likes <extra_id_0> and also <extra_id_1>."
]

inputs = tokenizer(
  my_masked_text,    # tokenizer will encode each string in your list
  padding="longest", # need to pad if encoded strings are different of lengths
  return_tensors="pt", 
)

sequence_ids = model.generate(
  input_ids=inputs["input_ids"],
  attention_mask=inputs["attention_mask"]
)
sequences = tokenizer.batch_decode(sequence_ids)

Then you should have a decoded predictions list sequences like the example above.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.