Hugging Face Tokenizer

The typical base class you are using when using a Tokenizer is PreTrainedTokenizerBase. The main method for tokenizers is __call__ which is the “method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.”

It makes a call to _call_one which calls batch_encode_plus or encode_plus which then call _batch_encode_plus or _encode_plus respectively. This methods are not implemented in PreTrainedTokenizerBase and are implemented in the sub-classes.

For instance, PreTrainedTokenizerFast implements _batch_encode_plus and makes use of _batch_encode_plus in it’s implementation of _encode_plus.

Both _batch_encode_plus and _encode_plus accept the following as arguments:

max_length
padding_strategy
truncation_strategy
etc.

However, you should not specify padding_strategy and truncation_strategy directly as these are calculated internally via _get_padding_truncation_strategies. Instead, you should specify padding and truncation as seen in the arguments to __call__.

Using a tokenizer

You can initialize a tokenizer with:

model_id = 'google/flan-t5-large'
tokenizer = AutoTokenizer.from_pretrained(model_id)

You can then use a tokenizer with:

sentences = [
    "Tell me a joke about Lion.",
    "Tell me a joke about Lion that is funny.",
    "Tell me a joke about Dogs that is funny.",
    "Tell me a joke about Lion that is funny."
]
encoded = tokenizer(sentences, truncation=True, padding=True, max_length=10, return_tensors="pt").to("mps")

The expected output looks like:

Tell me a joke about Lion.</s><pad>
Tell me a joke about Lion that is</s>
Tell me a joke about Dogs that</s>
Tell me a joke about Lion that is</s>

Note: You need to specify truncation, padding, max_length, and return_tensors when you do tokenizer.__call__(). You are also able to pass them as arguments to __init__ since HuggingFace allows passing arbitrary values which are then stored as self.init_kwargs but these are not used when executing __call__().

Working with pairs of sequences

The tokenizer allows you to tokenize a sequence, a batch of sequences, a pair of sequences, or a batch of pairs of sequences.

Why is it common to work with pairs of sequences?

You often work with pairs of sequences in NLP for text2text tasks like translation, question answering, summarization, etc.

You can process a pair of sequences as follows:

sentences = [
    "Tell me a joke about Lion.",
    "Tell me a joke about Lion that is funny.",
]
encoded = tokenizer(*sentences, truncation="only_second", padding=True, max_length=15, return_tensors="pt").to("mps")

Which when decoded results in:

Tell me a joke about Lion.</s> Tell me a joke</s>

You can process a batch of pairs of sequences as follows:

first_sentences = ["Hello, I'm a science student.", "I love studying planets."]
second_sentences = ["I'm applying for a Ph.D. in Astronomy.", "Saturn is my favorite."]
 
# Tokenizing batch of sentence pairs
encoded = tokenizer(first_sentences, second_sentences, padding=True, truncation=True, return_tensors="pt")

which results in:

Hello, I'm a science student.</s> I'm applying for a Ph.D. in Astronomy.</s>
I love studying planets.</s> Saturn is my favorite.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Truncation strategies for pairs of sequences: The truncation argument controls truncation. It can be a boolean or a string. When dealing with pairs of sequences you can handle the truncation in special ways:

True or 'longest_first': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.
'only_second': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the second sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
'only_first': truncate to a maximum length specified by the max_length argument or the maximum length accepted by the model if no max_length is provided (max_length=None). This will only truncate the first sentence of a pair if a pair of sequences (or a batch of pairs of sequences) is provided.
False or 'do_not_truncate': no truncation is applied. This is the default behavior. Source

Eric's AI Notes

Explorer

Hugging Face Tokenizer

Using a tokenizer

Working with pairs of sequences

Graph View

Table of Contents

Backlinks

Eric's AI Notes

Explorer

Hugging Face Tokenizer

Using a tokenizer §

Working with pairs of sequences §

Graph View

Table of Contents

Backlinks

Using a tokenizer

Working with pairs of sequences