Fine-tuning a Text2Text LLM using different tokenizers for input and output

Question

I’m just starting to explore the Hugging Face library and have a question related to Text2Text models.

Suppose I have a model1 (a Text2Text model, e.g. BART) pre-trained on a masked language modeling task, where it has learned the syntactic structure based on the tokenization strategy of tokenizer1.

Now, I want to fine-tune model1 using the same style of text related to the masked language modeling task as input, but aim to decode outputs into a different format using a separate tokenizer (tokenizer2).

Is this possible? The approach I had in mind involves sequential text generation:

The original model1 generates text.
A fine-tuned model2 continues the generation based on the output of model1.

Apologies if this is something trivial. Any comment or suggestion on specific tutorials is really appreciated!

Jongeun BAEK · Accepted Answer · 2025-01-21 09:12:23Z

1

the output from model1 is text form and the input to model2 is text form, too. So it is ok.

answered Jan 21 at 9:12

Jongeun BAEK

111 bronze badge

Sign up to request clarification or add additional context in comments.

1 Comment

Community Jan 21 at 16:22

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Fine-tuning a Text2Text LLM using different tokenizers for input and output

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related