9

in the Tokenizer documentation from huggingface, the call fuction accepts List[List[str]] and says:

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

things run normally if I run:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

but if I try to emulate batches of sentences:

 test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 test = [test, test]
 tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
 tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")

I get:

Traceback (most recent call last):
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/modify_scores.py", line 53, in <module>
    tokenized_test = tokenizer(text=test, padding="max_length", is_split_into_words=False, truncation=True, return_tensors="pt")
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2548, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2634, in _call_one
    return self.batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2825, in batch_encode_plus
    return self._batch_encode_plus(
  File "/Users/lucazeve/Coding/WxCC_Sentiment_Analysis/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 428, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Is the documentation wrong? I just need a way to tokenize and predict using batches, it shouldn't be that hard.

Is it something to do with the is_split_into_words arguments?


Contextualizing

I will feed that into a sentiment score model (the one defined in the code snippets). I am facing OOM problems when predicting it so I need to feed the data in batches to the model.

The documentation (refered above) stated that I can feed List[List[str]] in the tokenizer which is not the case. The question remains the same: How to tokenize batches of sentences?

Note: I don't need the tokenizing process to be in batches (although it would yield batches of tokens/attention_tokens), which would solve my problem: using the model for prediction with batches like this:

with torch.no_grad():
    logits = model(**tokenized_test).logits

17
  • Please add to the question, a little more details on the inputs and the expected outputs. If I'm understanding it correctly, you are trying to run the tokenizer on list of strings and inside the list of string, it contains some multi-word expressions? Commented Jun 7, 2023 at 11:55
  • No, I don't know why are you assuming so many things and changing my question so many times. The question is clear: I need to tokenize my dataset (collection of sentences) into batches. That is all. Please stop changing my question. Commented Jun 7, 2023 at 13:23
  • What is the NLP task you're working on? Which model are you using eventually? And is it for classification/similarity? How does your data look like before feeding it to the tokenizer? Having those information will help us to help you better. Commented Jun 7, 2023 at 13:23
  • It's because the question is ambiguous and I'm trying to get more clarification, otherwise we'll all be guessing. Please fill in the information asked in the comment above. Commented Jun 7, 2023 at 13:23
  • 1. Different tokenizers work differently in Hugigngface (unlike non-pretrained models NLP). 2. The task you are working on determine how the tokenizer function work 3. Not having an example of how the input and expected output will not help us help you. Commented Jun 7, 2023 at 13:26

2 Answers 2

13

How to tokenize a list of sentences?

If it's just tokenizing a list of sentences, do this:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]
 
tokenizer(test)

It does the batching automatically:

{'input_ids': [
 [101, 7592, 2023, 2003, 1037, 3231, 102], [101, 2008, 21743, 1037, 2862, 1997, 11746, 102], 
 [101, 2046, 1037, 2862, 1997, 2862, 1997, 11746, 102], 
 [101, 1999, 2344, 2000, 7861, 9869, 1010, 1999, 2023, 2553, 1010, 2048, 14108, 2229, 1997, 1996, 2168, 18798, 13900, 102], 
 [101, 2000, 2022, 19204, 3550, 2011, 1996, 1044, 2546, 19204, 17629, 2005, 1996, 4225, 2944, 102]], 

'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

How to use it with the AutoModelForSequenceClassification?

And to use it with AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english'), it's this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

test = ["hello this is a test", "that transforms a list of sentences", "into a list of list of sentences", "in order to emulate, in this case, two batches of the same lenght", "to be tokenized by the hf tokenizer for the defined model"]

model(**tokenizer(test, return_tensors='pt', padding=True, truncation=True))

[out]:

SequenceClassifierOutput(loss=None, logits=tensor([[ 1.5094, -1.2056],
        [-3.4114,  3.5229],
        [ 1.8835, -1.6886],
        [ 3.0780, -2.5745],
        [ 2.5383, -2.1984]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

How to use the distilbert-base-uncased-finetuned-sst-2-english model for sentiment classification?

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']
 
classifier(text)

[out]:

[{'label': 'NEGATIVE', 'score': 0.9379092454910278},
 {'label': 'POSITIVE', 'score': 0.9990271329879761},
 {'label': 'NEGATIVE', 'score': 0.9726701378822327},
 {'label': 'NEGATIVE', 'score': 0.9965035915374756},
 {'label': 'NEGATIVE', 'score': 0.9913086891174316}]

What happens when I've OOM issues with GPU?

If it's the distilbert-base-uncased-finetuned-sst-2-english, you should just use the CPU. For that you won't face much OOM issues.

If you need to use a GPU, consider using the pipeline(...) inference and it comes with the batch_size option, e.g.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)


text = ['hello this is a test',
 'that transforms a list of sentences',
 'into a list of list of sentences',
 'in order to emulate, in this case, two batches of the same lenght',
 'to be tokenized by the hf tokenizer for the defined model']

classifier(text, batch_size=2, truncation="only_first")

When you face OOM issues, it is usually not the tokenizer creating the problem unless you loaded the full large dataset into the device.

If it is just the model not being able to predict when you feed in the large dataset, consider using pipeline instead of using the model(**tokenize(text))

Take a look at https://huggingface.co/docs/transformers/main_classes/pipelines#pipeline-batching


If the question is regarding the is_split_into_words arguments, then from the doc

text (str, List[str], List[List[str]], optional) — The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

And from the code

if is_split_into_words:
    is_batched = isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))
else:
    is_batched = isinstance(text, (list, tuple))

And if we try that to see if your inputs is_batched:

text = ["hello", "this", "is a test"]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

False

But when you wrap the tokens around a list,

text = [["hello", "this", "is a test"]]
isinstance(text, (list, tuple)) and text and isinstance(text[0], (list, tuple))

[out]:

True

Therefore, the usage of the tokenizer and is_split_into_words=True to get the batch processing working properly would look something like this:

from transformers import AutoTokenizer
from sacremoses import MosesTokenizer

moses = MosesTokenizer()
sentences = ["this is a test", "hello world"]
pretokenized_sents = [moses.tokenize(s) for s in sentences]

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')

tokenizer(
  text=pretokenized_sents, 
  padding="max_length", 
  is_split_into_words=True, 
  truncation=True, 
  return_tensors="pt"
)

[out]:

{'input_ids': tensor([[ 101, 2023, 2003,  ...,    0,    0,    0],
        [ 101, 7592, 2088,  ...,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}

Note: The use of the is_split_into_words argument is not to process batches of sentence but it's used to specify when your input to the tokenizers are already pre-tokenized.

Sign up to request clarification or add additional context in comments.

15 Comments

This argument is used for when you have alreade pretokenized the input. Thus, your input is split into WORDS. Mine isn't. My input is split into sentences, not into words. The value for the argument should be FALSE. You dont have batches in your solution. You basically modify the code to fit into a solution that doesn't suit my problem. I need to do batch processing, not just give a justification to use the is_split_into_words argument.
Hmmm, I think it'll work too =) Let me edit my answer. BTW, be nice, we're all volunteers trying to help others with some answer.
So you are saying that I need tokenize twice to do batch processing?
Nope, just once but it depends on what you want to achieve. If you already pre-tokenized the output, the is_split_into_words actually helps you to stitch it back, not to "activate the batching".
I will edit the question
|
4

Use pipelines, but there is a catch.

Because you are passing all the processing steps, you need to pass the args for each one of them - when needed. For the tokenizer, we define:

 tokenizer = AutoTokenizer.from_pretrained(selected_model)
 tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
 

The model is straight forward:

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Then finally:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer)

Specific to my application:

Since I need the logits and not the predicted classes, I will have to modify the pipeline class. Documentation says that in order to create a custom pipeline class, I need to define four mandatory methods: implement preprocess, _forward, postprocess, and _sanitize_parameters... OR I can overwrite postprocess method from the TextClassificationPipeline:

class MyPipeline(TextClassificationPipeline):
     def postprocess(self, model_outputs):
         return model_outputs["logits"][0]

and modify the call:

classifier = pipeline("text-classification", model=model, batch_size=32, tokenizer=tokenizer, pipeline_class=MyPipeline)

logits = classifier(text, **tokenizer_kwargs)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.