Action guide

How LLMs Tokenize Text

Billing and limits disagree with your word count. Rare names and compounds blow up. Tokenization hits cost and quality like a tax you did not see coming.

Get the full guide

Free newsletter unlocks the full guide and subscriber links. Same library working engineers use. No pedigree bingo.

Free. No spam. Unsubscribe anytime.

Why subscribe

Pricing says tokens; intuition says words. Until rare tokens chew budget and quality, you need the tokenizer story before you argue about context limits.

For: Engineers sizing prompts, retrieval, and budget who must explain surprises without sounding like a textbook.

  • A concrete mental model of subword behavior
  • Why rare strings explode length and hurt outputs
  • Better budgeting conversations with finance and PM
  • Examples that break naïve splitting
  • Rules of thumb for edge cases in prod prompts
  • Hooks into latency and cost reasoning
  • Links tokenizer quirks to failures users actually see
Word-level tokenization examples with colored word boxes, then a hard sentence with hyphens and apostrophes, plus panels on vocabulary size and the OOV problem

What you’ll learn

What whitespace / word-level tokenization pretends to solve, where it breaks (hyphenated and possessive words, rarer surface forms), why a growing word list does not scale, and what out-of-vocabulary (OOV) means when the model was never trained on a surface string as a single word.

When you subscribe to the newsletter, you get access to the full online guide alongside course and issue updates.

Explore the other action guides

Each guide kills one sharp problem. You leave with steps you can type, not inspiration quotes.

Unlock the library

Free subscription. Full guide access. Future drops included. Same files I email to people who ship.

Free. No spam. Unsubscribe anytime.