1

I'm trying to figure out what BERT preprocess does. I mean, how it is done. But I can't find a good explanation. I would appreciate, if somebody know, a link to a better and deeply explained solution. If someone, by the other hand, wants to solve it here, I would be also extremly thankful!

My question is, how does BERT mathematically convert a string input into a vector of numbers with fixed size? Which are the logical steps that follows?

1 Answer 1

3

BERT provides its own tokenizer. Because BERT is a pretrained model that expects input data in a specific format, following are required:

  • A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
  • A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
  • Tokens that conform with the fixed vocabulary used in BERT
  • The Token IDs for the tokens, from BERT’s tokenizer
  • Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
  • Segment IDs used to distinguish different sentences
  • Positional Embeddings used to show token position within the sequence

.

from transformers import BertTokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# An example sentence 
text = "Sentence to embed"

# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"

# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)

# Map the token strings to their vocabulary indices.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) 

Have a look at this excellent tutorial for more details.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.