How to denoise text using T5?

Question

I'm trying to denoise text using a T5 model following the Huggingface doc:

from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids

# the forward function automatically creates the correct decoder_input_ids
loss = model(input_ids=input_ids, labels=labels).loss
loss.item()

But I can't figure out how to get the actual text that corresponds to the masked input. They only show how to get the loss and mention

the forward function automatically creates the correct decoder_input_ids

I tried the following:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
tokenizer.batch_decode(logits.argmax(-1))

But the output doesn't make sense:

['<extra_id_0> park park<extra_id_1> the<extra_id_2> park']

I don't care for the loss, nor do I have labels in my setting. I just have text with masked tokens that I need to fill:

my_masked_text = [
"The kid went to the [MASK]",
"The dog likes [MASK] and also [MASK]"
]

postylem · Accepted Answer · 2023-05-29 03:20:48Z

In the in docs for T5 (§ Inference) there is an example of what you're looking for.

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids

sequence_ids = model.generate(input_ids)
sequences = tokenizer.batch_decode(sequence_ids)

for the result that sequences is the following:

['<pad><extra_id_0> park offers<extra_id_1> the<extra_id_2> park.</s>']

Where this to be interpreted as follows:

'<pad>'                   # Marks beginning of output sequence 
'<extra_id_0> park offers'# <- model prediction for first blank
'<extra_id_1> the'        # <- model prediction for second blank
'<extra_id_2> park.</s>'  # ignore (there was no third blank)

So the model filled in the blanks as

"The park offers walks in the park"

For your examples, that means you'd do something like the following (haven't tested this, but it should work modulo some typo):

my_masked_text = [
  "The kid went to the <extra_id_0>.",
  "The dog likes <extra_id_0> and also <extra_id_1>."
]

inputs = tokenizer(
  my_masked_text,    # tokenizer will encode each string in your list
  padding="longest", # need to pad if encoded strings are different of lengths
  return_tensors="pt", 
)

sequence_ids = model.generate(
  input_ids=inputs["input_ids"],
  attention_mask=inputs["attention_mask"]
)
sequences = tokenizer.batch_decode(sequence_ids)

Then you should have a decoded predictions list sequences like the example above.

Collectives™ on Stack Overflow

How to denoise text using T5?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related