LLM: What Is a Token?

In NLP (Natural Language Processing) and LLMs (Large Language Models), a token is a basic unit of text — but it’s not always the same as a word.

Think of a token as a building block that the model uses to understand and generate language.

How text gets split into tokens is determined by something called a tokenizer, and the way it splits can vary depending on the system or model.

🔍 Tokens ≠ Words

Here’s the key idea:

Tokens can be whole words, parts of words, or even punctuation marks.

This is especially true for systems like OpenAI’s GPT or Google’s T5, which use what's called subword tokenization.

✅ Example: “unbelievable!”

Depending on the tokenizer, this might be split as:

un, believ, able, !
→ 4 tokens

Whereas a simpler, word-based tokenizer might just do:

unbelievable, !
→ 2 tokens

And yes — the exclamation mark is its own token too!

✨ Why So Granular?

Tokenizing this way helps models:

Understand unfamiliar or rare words by breaking them into known parts.
Handle different languages or spellings more effectively.
Compress text into consistent units that are easier to train on.

💬 TL;DR

A token is a unit of text used by language models.
It might be a word, part of a word, or even punctuation.
This depends on how the text is broken down by the tokenizer, which is an essential part of how LLMs work.

You don’t need to know all the details of how tokenizers work, but it’s helpful to understand that a 'token' isn’t always the same as a word — it can be a smaller piece of language the system uses behind the scenes.