Tokens

Before AI can understand a single word, it breaks everything into tiny pieces.

Imagine reading a book where you can only see tiny chunks at a time either a word, or part of a word, even just punctuation. Each chunk is a token. Common words may be one token, while rarer or longer words may be split into smaller pieces.

The tokenizer's job is to convert text into token IDs that the model can process. It is essentially a pre-created list or vocabulary of text pieces, each mapped to a unique number.

The976 cat9059 sat10139 on402 the290 mat2450
Tokens
6
Chars
22
Ratio
3.7
Real-world analogy
Think of tokens like LEGO bricks. You don't build with pre-made structures, you break everything down to individual bricks first. Tokens are the AI's bricks. Every AI has a different set of bricks (vocabulary) and rules for how they can be combined (tokenization). The model learns to build complex ideas by stacking and arranging these token bricks in different ways.
Jargon decoded

Vocabulary - all tokens the AI knows

Token ID - every token's unique number in the vocabulary dictionary

Tokenization - splitting text into tokens before any processing happens

Context window - the maximum number of tokens an AI can process at once

Your message is now a sequence of numbered tokens. But the model has no idea what any of those numbers actually mean. Token 1234 is just token 1234. Before anything else can happen, every token needs to be converted into something richer. A list of numbers that carries real meaning. That is what embeddings do.