How artificial intelligence actually works, and how to build with it — from the very first idea to the business of AI and the frontier. In plain language.
This manual is the whole journey in one place — the same lessons, distilled into a reference you can keep. Read it in order: every part stands on the one before it, and the words in mono are defined in the Glossary at the back.
Before we start, one map. The word "AI" gets used for five different things, and they're nested inside each other like Russian dolls. When you use a tool like Claude, you're using all five at once — each one sitting inside the next.
How the machine actually thinks. Four ideas that, stacked together, explain every modern AI system — no math required.
For all of computing history, humans wrote the rules. Machine learning flipped it: show the machine examples, and it works out the rules itself.
Traditional software runs on rules a human wrote by hand — if this, then that. A programmer anticipates every case and spells out exactly what to do. That's perfect for taxes and spreadsheets, and it falls apart the moment a problem gets fuzzy and human.
Take a simple task: is there a dog in this photo? Try to write that as rules. "Has fur" — so does a cat. "Four legs" — so does a table. You could write rules for a century and never capture every dog. Machine learning takes the opposite path: you show it ten thousand photos labelled "dog" and ten thousand labelled "not dog," and it learns the pattern itself — the way a toddler learns "dog" from being shown dogs, not handed a definition.
A toddler learns "dog" by seeing dogs, never a dictionary definition. Same with the machine — pattern from examples, not rules from a programmer.
This is why AI can do the messy-human work old software never could: write, summarize, read a CRM, judge a tone. When a task is fuzzy, reach for learning-from-examples — not hand-written rules.
A neural network is a wall of billions of little knobs. It learns by being wrong, over and over, and nudging each knob a hair toward being less wrong.
Don't picture a brain — picture a giant mixing board with billions of knobs. Each knob is a weight. Information flows in one side, through all the knobs, and an answer comes out the other. At first the knobs are random, so the first answers are garbage. That's expected.
Then the magic: compare the answer to the right answer, measure how wrong it was (the error), and nudge every knob a hair in the direction that would have made it less wrong. That backward "assign blame, then adjust" step is backpropagation. Do it across millions of examples and the knobs settle into settings that just work. The slow walk toward less error is gradient descent — a hiker feeling downhill through fog, one step at a time.
Tuning a guitar by ear: pluck, hear how far off it is, turn the peg a touch, repeat. The "knowledge" lives in the final knob positions — not in any fact written down.
A language model does one thing: predict the next word. Run that in a loop and it writes anything — and to get good at it, it had to learn almost everything.
Generation is recognition, run in a loop. The model predicts the single most likely next token (a word-piece), sticks it on the end, feeds the whole thing back into itself, and predicts again. One word, add it, ask again — a few hundred times — and out comes an email, a story, a block of code. It is never "writing a sentence." It is only ever predicting the very next word.
Here's the deep part. To predict the next word well, the model is forced to absorb grammar, facts, reasoning, and tone. "The capital of France is ___" requires geography. The intelligence is a side-effect of getting superhuman at one simple game. It works in probabilities; the temperature dial sets how adventurous its pick is — and that same machinery is exactly why it can be confidently, fluently wrong (a hallucination).
Phone autocomplete, scaled to a staggering degree. The difference is that getting truly good at autocomplete across the whole internet quietly requires understanding the world.
Because it can't not produce a next word, it will bluff when it doesn't know. Half of using AI well is knowing when to trust that fluency and when to verify it.
The 2017 idea that runs all of it: let every word look at every other word at once and decide what matters. That mechanism is the Transformer — the "T" in GPT.
Before 2017, models read text in order, one word at a time, passing a little summary along — and by forty words deep, the start was a blur. The paper "Attention Is All You Need" threw that out for one idea: self-attention. Let every word look at every other word in the passage simultaneously and decide which ones are relevant.
In "The trophy didn't fit in the suitcase because it was too big," attention is what lets "it" shine a spotlight across the sentence and land on "trophy," not "suitcase." Three things fell out of this and changed everything: it's parallel (train on GPUs at massive scale), it handles long-range connections, and it scales — bigger model + more data keeps getting smarter. That machine is the Transformer, sitting under Claude, GPT, and nearly every model you touch.
How to make it work for you. Three tools that turn understanding into leverage — and into things people pay for.
If the output is just the most likely continuation of the input, then whoever controls the input controls the output. Prompting isn't asking — it's setting the stage.
Picture the model as the greatest improv actor alive — infinite range, but total amnesia. Every time the curtain rises it remembers nothing; all it has is the script you hand it, and it instantly becomes that character and plays the scene forward. Your prompt is the script. That's why two people get wildly different results from the same model — one hands it a sticky note, the other a full screenplay.
Four levers follow directly: It's an amnesiac — if it's not on the page, it doesn't exist, so give it the context. Cast the role ("you are a senior estimator…") to aim it at the expert region of what it learned. Show, don't tell — paste one example of "great" and it matches the pattern instantly. Be specific — vagueness gets you the beige, average answer, because that's the safest next word.
A brilliant actor with amnesia. Hand him a thin note, you get generic improv. Hand him a rich script — who he is, the scene, one great line — and you get an Oscar performance.
An assistant that "sounds like you" is just a well-written system prompt: role + context + tone + an example. You're not asking the model — you're casting it.
A model's knowledge is frozen, and it knows nothing private. RAG fixes that: turn its closed-book exam into an open-book one — slide the right page under its nose right before it answers.
Without help, the model answers from memory and bluffs when it's stumped. RAG — retrieval-augmented generation — changes that. You chop your documents into chunks, store them in a vector database that files them by meaning (using embeddings — meaning as coordinates), and for each question you retrieve the few most relevant chunks and drop them into the context before the model answers.
Now the answer comes from real, current, private text — so the model reads instead of recalls, hallucination drops, and it can cite its source. And because the knowledge lives outside the model, you update it by editing a document, never retraining.
A client chatbot that answers from their real pricing and policies — current, private, and honest — is RAG. It's a product you can build and charge for.
An agent is a brain, hands, and a loop. The model can only output text — so it outputs a request to use a tool; your program runs it for real and feeds the result back. Think, act, observe, repeat.
A model alone can only talk. An agent gives it hands: tools — a calendar, a CRM, an email, a web search. The trick is simple: the model outputs a special piece of text meaning "use the calendar tool now"; the program wrapped around it actually runs the calendar, then pastes the result back into the model's context. The model reads that and decides its next move. That loop — think, act, observe — run until the goal is met, is the agent.
This turns a thing that talks into a thing that does: books the appointment, updates the record, sends the follow-up. The difference between a chatbot and a digital employee.
Áine's morning briefing is this loop: think "I need the calendar" → call it → read it → "now the ad numbers" → call them → then write the summary. A brain, using hands, in a loop.
Make it yours, and stay ahead. Customizing models, knowing when they're good, and reading where the whole field is going.
Fine-tuning is more training on your own examples. It changes the model's behavior and style — not its knowledge. Knowledge is RAG's job.
A model is born in two rounds: pretraining on the whole internet makes a general word-predictor, then a lighter round of fine-tuning shapes it into a helpful assistant. You can run that second round yourself — take a trained model and nudge its knobs further on a focused set of example pairs, until it does a task in a style automatically.
The make-or-break distinction: fine-tuning changes how it behaves, not what it knows. Want it to know today's pricing? That's RAG — facts change, and baked-in facts go stale. Want it to write in your exact voice, or sort every lead the same way, ten thousand times? That's fine-tuning. The smart order is an escalation ladder: prompt first (free, instant), add RAG if it needs knowledge, and fine-tune only when you need rock-solid behavior at scale.
Prompting is instructing a new hire for one task. RAG is handing them the company binder. Fine-tuning is the apprenticeship — practice until the skill is just in their hands.
Estimates that always sound like a 15-year veteran → fine-tune the behavior. The current prices in them → RAG the facts. Two tools, two jobs.
An eval is a systematic taste test: a fixed set of cases plus a way to grade them — so you know whether a change actually helped, instead of guessing on vibes.
The model is random (remember temperature), so the same prompt can shine on Tuesday and flop on Wednesday. Judging it by a single output is tasting one spoonful of soup and declaring the whole pot perfect. An eval fixes that: build a test set of representative cases, and every time you change anything — the prompt, the model, the retrieval — run the whole set and score it. Now you can compare versions with numbers.
You grade three ways: hard checks (did it output valid JSON? the right category?), human rating (the gold standard for taste, but slow), and LLM-as-judge (another model grading the first against a rubric — fast and scalable). And the mindset that makes it powerful: define what "good" looks like before you build, so you have a target to measure against.
"It answers correctly 98% of the time on our test set, and we'll know the moment that drops" — that sentence is the difference between a demo and a product someone pays for monthly.
Every "scary new breakthrough" is one of these same ideas, grown. The tools change every few months; the fundamentals barely move. That gap is your relevance moat.
Three directions the field is moving — and each is something you already understand, with a new limb. Multimodal: models that see, hear, and speak — same tokens, same Transformer, just pointed at pixels and sound. Reasoning models: trained to think on a scratchpad before answering — still next-word prediction, just given room to reason. More autonomous agents: the same think-act-observe loop, stretched longer and trusted further — from tool you operate to teammate you delegate to.
The lesson under all of it: you didn't learn this year's vocabulary, you learned the grammar of the field. Every future headline reads as "that's just attention plus scale," or "a sharper eval," or "an agent with longer reach." And as the machines get better at producing, the rare human things — judgment, taste, trust, asking the right question — grow more valuable, not less.
Mastery. The hands-on disciplines that separate playing with AI from building things people pay for.
The model's working memory is the context window — a finite workbench. The real skill isn't clever wording; it's curating exactly what goes on that bench each call.
Every time the model runs, the only thing it knows is the text in its context window, and from that it predicts the next word. So its entire behavior is decided by what's in the window. Most people think the skill is "writing a clever prompt." It isn't — it's deciding what to load into that window in the first place.
Picture the window as a workbench. Finite surface. Your job each call is to lay out exactly the right tools: who the model is, the specific facts it needs, the task, maybe one example. And here's the part that surprises people — more is not better. Dump your whole pantry on the bench and the model does worse: irrelevant clutter dilutes its attention. A tight brief beats a giant dump. Even placement matters — models attend most to the beginning and end, so the middle of a huge context can get lost.
A chef's prep station. Lay out exactly the right ingredients and they cook beautifully. Pile the whole pantry on and they fumble for the salt.
The model's only scratchpad is the text it writes. Make it reason step by step before the final answer, and hard problems get dramatically better.
Recall that the model writes one word at a time, feeding its own output back in as it goes. That means it has no hidden place to think — its only thinking space is the text it actually produces. So if you demand only the answer to a hard question, you force it to blurt: commit to the first word instantly, with no room to work. Ask it to reason out loud first, and those steps become context that guides it to a far better conclusion. You're handing it paper.
This is the same engine behind reasoning models — but you can summon it for free, with one instruction: "think it through step by step," "list the key factors, then recommend." It leans on good context (4.1) and it cuts hallucination, because the model has to earn the conclusion instead of leaping to it. The tradeoff: more words, more time — so use it where thinking actually matters, not for a simple lookup.
"What's 17 × 24?" — answer instantly, out loud, and most people fumble. Hand them paper to show their work and they nail it. The words the model writes are its paper.
Any real judgment call — which lead to chase, how to handle a delicate client reply — tell the model to reason it through before answering, and its judgment noticeably sharpens.
Control the form, not just the content. Hand the model a template to fill, and it stops being a chat box and becomes a reliable software component.
By default the model is chatty and unpredictable — three sentences today, a paragraph tomorrow, a "Sure! Here's what I found" on top. Fine for conversation, useless when a program or a brand standard needs an exact, repeatable form. The fix: tell it precisely what shape you want, and better yet, hand it the shape to fill. It works because the model just follows the groove you set — the most natural continuation is to fill your format.
Three moves make it airtight: be explicit ("exactly three bullets," "only JSON in this shape"); show the skeleton and let it fill the blanks; and forbid the noise ("no preamble, just the result"). Structured output — clean JSON your code can read — is what turns the model into a component you wire into a system. It's exactly how agents work: a tool call is just a precise structured output. Pair it with 4.2 — reason first, then emit the final answer in a marked block — for thoughtful results you can still parse.
Restore Paver estimates in your exact template and voice, every time. CRM-ready lead summaries in the same fields. That consistency is a feature you can sell.
RAG is only as good as what it retrieves. The model will confidently use whatever you put in front of it — so retrieval quality is the whole game.
The model is a brilliant expert who answers from whatever papers you slide over. Slide the right page, it's gold. Slide the wrong page, it gives a polished, confident, wrong answer and sounds just as sure. It can't tell good context from bad — so the system that fetches the pages matters as much as the genius answering.
The craft: chunk along natural seams (sections, Q&A pairs) — too big is noisy, too small loses meaning. Search by meaning with embeddings, then re-rank or blend in keyword search so the truly relevant chunk rises to the top. And the single most important instruction — the one that buys trust: tell the model to answer only from the retrieved material, and to say "I don't know" if it isn't there. Add citations so every answer is checkable. Retrieval is just context engineering on autopilot.
A genius who'll answer from whatever papers you hand them. Get the librarian wrong, and the genius hands you garbage — with a perfectly straight face.
For a client's customer-facing bot: clean chunking + good retrieval + "answer only from these docs, or admit you don't know." That discipline is what makes it safe to ship.
Building a good agent is about constraint. Manage it like a sharp but green new employee: clear goal, the right few tools, guardrails, and a human gate on anything that bites.
An agent is a model in a loop with tools (2.3) — but every step is still prediction, so errors compound: one wrong step poisons the context for all that follow. A loose agent wanders, loops, grabs the wrong tool, or confidently takes a wrong action — automatically, maybe many times. The more autonomy, the more risk. So the whole craft is one word: constraint. Scope tightly, fail safely.
Concretely: give it a narrow goal, not "run my business." Give it the right few tools — every extra one is another way to go wrong — and describe them clearly, because the model picks tools from the descriptions you wrote (they're context). Put guardrails on it, and above all a human gate on anything irreversible — sending, paying, deleting, publishing. Log every step. And know when not to use an agent at all: if a single prompt or a fixed script does the job, use that.
A capable new hire — fast, tireless, a little overconfident. A good manager doesn't hand over the keys to everything; they give one clear task, the right access, clear rules, and they check the important work before it ships.
Your weekly human review in the content engine is exactly this instinct — the single most important guardrail. Speed everywhere else; a gate on the things that bite.
Treat your AI like software with a test suite. Build a small set of real examples, run every change against it, and let the score decide — not your gut.
You met evals in 3.2 as an idea: a repeatable test that scores whether the system is any good. In practice it's a discipline. Because the model is random (temperature), judging one output by eye lies to you — you might've just hit a good spoonful of the soup. So you build a golden set: twenty to fifty real examples from actual use, each paired with what a good answer looks like. That little set is your definition of "good," made concrete.
Then you run the whole set on every change — a prompt tweak, a new model, different retrieval — and compare scores. That's regression testing for AI: it catches the change that felt better but quietly broke three other things. Grade pragmatically: hard checks where there's a right answer, an AI judge for tone and quality at scale, human spot-checks as the gold standard. And the move that compounds — every time it fails in the real world, add that case to the set, so the bug can never silently return.
Unit tests for your AI. Real engineers don't ship and hope — they keep a test suite that goes red the instant something breaks. Evals are that, for a system that's allowed to be a little random.
Thirty real scenarios for Áine or a client bot. Run them on every change; add every real-world miss. Now you can tell a client a true number — "right 96% of the time, and we catch regressions before you do" — and mean it.
Where the money and the moats actually are — and how to position a business so the technology becomes a tailwind, not a threat.
AI is electricity. The fortunes aren't in building power plants — they're in wiring it into a specific business. That last mile is where small, sharp operators win.
The AI economy has layers. At the bottom, the foundation models — the power plants. Building them costs billions; a winner-take-few war between giants. Not your game. In the middle, tooling and infrastructure — the wiring. And at the top, the application, the last mile: taking a powerful but generic model and making it solve a real, specific problem inside a real business's messy workflow.
That top layer is where you win, because it takes what the giants won't bother with: domain knowledge, last-mile hustle, and trust. The model is a commodity ingredient now — anyone can buy the same one. The value you add is everything around it: the problem solved, in context, for someone who trusts you. Don't build the power plant. Wire the building.
Restore Paver, Áine-as-a-service, the content engine, the Field Guide — every one is you owning the last mile. That's the whole thesis of Rod & Staff, and it's a strong one.
"I use AI" is not a moat — the model is a commodity everyone can buy. Your moat is what compounds around it and can't be copied overnight.
A moat is what stops a competitor copying you and taking your customers. Since the model is the same kitchen everyone now has, the defensible part is never the kitchen — it's the secret recipe, the regulars who trust you, the line out the door, being woven into the neighborhood.
The real moats, and how many you're already building: proprietary data (Restore Paver's job data, your educational library — stronger every job); domain expertise (knowing one industry's real workflow); trust & relationships (the whole Field Guide play); distribution (your content engine, your audience); and integration (once Áine is wired into daily ops, ripping her out hurts). Test every move: does it compound, and could a rival copy it overnight? Build the snowballs, not the sandcastles.
Beware the thin wrapper — a thin layer over the model with nothing proprietary around it has no moat; a competitor or the model provider can absorb it. Deepen on data, trust, and integration until you're un-copyable.
Never sell your hours — AI makes speed your enemy under that model. Price to the value you create, productize it, and make it recurring.
Charging by the hour is poison in the age of AI: the faster you get, the less you make. Flip it — price on the outcome. Nobody wants a man with a drill; they want the shelf on the wall. If Áine books a client $10k of extra appointments a month, the price anchors to that, not to your setup time. A system worth $5k/month to them, cheap for you to run — that gap is your margin.
Then two moves turn it into a business. Productize: turn bespoke work into a repeatable offering with a name, a scope, and a price (the Restore Paver platform, Áine-as-a-service). Make it recurring: a subscription beats a one-off every time — you own a herd instead of hunting every month, and recurring + integration is the stickiest combo there is. One honest requirement, straight from evals: to charge on value, you have to be able to name the value with a real number.
Sell the hole, not the drill. The client never asks how long it took — they're paying for the result, and a result that recurs is worth paying for every month.
The dangerous risks aren't technical — they're business-model risks. And the model getting better is a wave: positioned right it carries you; positioned wrong it crushes you.
Four risks. The thin wrapper — no moat, easily copied or absorbed. "The model ate my feature" — you build a clever product and the next model version does it natively, for free. Provider dependency — locked to one model, you're building on rented land. And commoditization — "we do AI" becomes worthless as everyone gets the same tools.
The defense for all of them is one move: build your value around the model — in your data, trust, integration, and customer relationship — not in a raw model capability the next version will swallow. Stay model-agnostic (architect so you can swap the underlying model), own the relationship, and keep moving up the value chain. Do that, and every model upgrade improves your product for free.
The model improving is a giant wave. Surf it — value around the model — and it carries you. Stand in front of it — your value is a model feature — and the same wave wipes you out.
The durable position is the trusted guide who owns the last mile — and it grows more valuable as AI gets more powerful and more bewildering.
Braid the tier together and it points at one identity: the guide. The position, clean enough for a wall — take the most powerful AI, wire it into a specific business (owning the last mile, the data, and the relationship), and teach the owner to understand it, so they grow with the technology instead of being left behind. You're not selling AI. You're selling guidance, integration, and trust, with AI as the tool.
Here's the strategic gold: a shepherd doesn't sell the pasture or the weather — they guide the flock through terrain it can't cross alone. And the demand for a trusted guide goes up as the terrain gets scarier. The faster and more confusing AI gets, the more valuable a steady hand becomes. Your position strengthens as the world speeds up — rare, and exactly where Rod & Staff already lives.
Lead with generosity (the free Field Guide) to build trust and distribution; convert the ones who'd rather hire it done into recurring, productized services. Relevance, by helping others stay relevant.
The real machinery, in plain language — what's actually happening inside the model when it thinks: meaning as geometry, attention up close, and the full journey from prompt to word.
The model turns every token into a vector — a point on a vast map of meaning. Position is meaning, distance is similarity, and direction is relationship.
Back in Part I we said meaning becomes coordinates. Here's the mechanism. Every token becomes a vector — a long list of numbers — which makes it a point in a space of hundreds or thousands of dimensions. Picture a giant map where every word is a dot, placed by what it means. The model wasn't told these positions; it learned them.
Three things are true on that map. Position is meaning — "dog" and "puppy" are neighbors; "dog" and "Tuesday" are across town. Direction is relationship — the step from "man" to "woman" is the same step as "king" to "queen," so you can do arithmetic on meaning: king − man + woman ≈ queen. And closeness is similarity — which is exactly what RAG does under the hood: "store by meaning" is placing chunks on this map; "retrieve the closest" is finding the nearest points. Everything the model does is geometry on these vectors.
Every word gets three roles — query, key, and value. Words match their query against others' keys, then pull in a weighted blend of the matching values. That's attention.
Part I said every word looks at every other word. Here's how. The model gives each word three versions of itself: a query — "what am I looking for?"; a key — a name-tag advertising what it's about; and a value — what it actually contributes. Picture a networking party: each word reads everyone's name-tags (keys) against its own question (query), finds the best matches, and takes notes — writing down each match's message (value), weighted by how good the match was.
So in "…because it was too big," the word "it" sends a query, finds "trophy" has the best-matching key, and pulls in trophy's value — now it carries the trophy's meaning. The model runs many of these in parallel (multi-head attention), each tracking a different kind of relationship. And a payoff: attention compares every word to every other word, so doubling the length roughly quadruples the work — the mechanical reason long context is slow and costly, and why curating it (4.1) matters.
Tokenize → embed → add position → many layers of attention-and-refine → turn the final vector into odds over the vocabulary → sample the next word. Then loop.
Now assemble the machine. Your text is chopped into tokens; each becomes its vector (6.1); a position stamp is added so order matters ("dog bites man" ≠ "man bites dog"). Then the vectors flow through a deep stack of layers — dozens of them. Each layer does two moves: attention (6.2), where words gather context, then a refine step where each word digests what it gathered. Early layers catch grammar; deeper layers build abstract meaning. That's why it's called deep learning.
After the last layer, the model takes the final vector at the most recent position — now soaked in the whole context — and turns it into a score for every word in its vocabulary, which becomes odds. It samples one (temperature sets how adventurous), and that's the next word. Then it appends the word and runs the entire pipeline again for the next one. Every word you've ever gotten from a model went through this — which is why longer outputs and bigger models cost more time.
Where all that capability comes from — the birth of a model: pretraining, instruction tuning, alignment, and how a giant is shrunk into something you can actually run.
Take a blank network, show it a huge slice of the internet, and train it on one self-supervised game — predict the next word — trillions of times. Out comes a raw "base model."
A model is born by training (Part I: guess, measure error, nudge weights). Pretraining is that at staggering scale. Start with random weights — pure gibberish. Then play one game on trillions of words of text: hide the next word, predict it, check against the real one, nudge the weights. The genius is that it needs no human labels — the text labels itself, so it can scale to essentially all the text in the world.
To get good at that one game across everything, the model is forced to compress the patterns of language, facts, and reasoning into its weights — compression becomes understanding. It costs tens to hundreds of millions of dollars; only giants do it (the power plant, 5.1). And scaling laws make the payoff predictable: bigger model + more data + more compute → smoothly more capable. The result is a raw base model — brilliant, but feral: just a text-completer, not yet an assistant.
Show the feral base model thousands of "request → ideal response" examples, and it learns the behavior of being a helpful, instruction-following assistant.
The base model knows everything but follows nothing. Instruction tuning (a form of fine-tuning, 3.1) keeps training it — now on a small, curated set of example conversations: a request paired with a great response. The model learns the pattern: "when given an instruction, respond helpfully like this." It's behavior, not knowledge — the knowledge already came from pretraining; this teaches manners and how to apply them.
That's why this phase is tiny next to pretraining — thousands of examples, not trillions of words. It's finishing school for a genius. The labs do it to create the assistant you talk to; and the same mechanism, with your data, is how you specialize a model for your business. Data quality dominates: a few thousand excellent examples beat a million mediocre ones.
A genius who's read everything but mumbles and wanders. A mentor shows them, with worked examples, how to actually help a person — until it's second nature. The brilliance was always there.
There's no answer key for "good," so you let humans rank responses, train a reward model to predict their preferences, and steer the model to chase them. Judging is easier than creating.
Instruction tuning makes the model helpful; alignment makes it good — honest, safe, tasteful. The main technique is RLHF (reinforcement learning from human feedback). For most requests there's no single right answer, so instead of an answer key you generate several responses, have humans rank them (A beats B), train a reward model to predict those preferences, then nudge the model toward what scores high. It learns judgment by absorbing thousands of human comparisons.
This installs the qualities you can't write down — tone, honesty, knowing when to refuse. Cleaner modern methods include DPO (training directly on the preference pairs) and Anthropic's Constitutional AI (the model critiques its own answers against explicit principles). But it's imperfect: a model can learn to look good rather than be good — sycophancy is trained-in approval-seeking. So don't mistake agreement for truth. Verify.
A finished model is too big to run cheaply, so you shrink it — distillation and quantization — into a family of sizes. Picking the smallest that passes your evals is where your margin lives.
Every word a model generates is a full trip through all its layers (6.3), so running a giant model — inference — is slow and costly. Two techniques shrink it. Distillation: a big "teacher" model trains a small "student" to imitate it, so the student punches far above its size. Quantization: store the weights at lower precision — like saving a photo at lower resolution — shrinking and speeding the model with little quality loss.
That's why every lab offers a family — a big flagship and smaller, faster, cheaper siblings. Choosing among them is a builder's skill: match the tool to the job (don't send the giant model to do a simple classification), and use your evals (4.6) to walk down to the cheapest model that still passes. Every task served on a small model instead of the giant — while passing evals — is margin you keep.
Prototype on the big model, then drop to the smallest that still passes your golden set. Model selection isn't a nerdy detail — it's a profit decision.
Beyond a single call. How you wire models, tools, memory, and multiple agents into serious products that hold up in the real world.
Serious products aren't one prompt — they're systems. Decompose the job into a pipeline of small, focused steps, each with a clean context, a clear output, and the right-sized model.
The trap is making one giant prompt do a complex, multi-step job. It's brittle, hard to debug, and errors pile up. The shift is to think like an engineer building an assembly line: break the task into focused steps. A content system might be research → draft → critique → revise → format. Each step is its own call — and the builder's craft snaps into place per step: a tight workbench (4.1), a clean shaped output (4.3), the right-sized model (7.4).
Decomposing makes everything better: reliability (eval each step), debuggability (you see which step broke), control (insert checks and human gates between steps), and cost (route each step to the cheapest model that passes). Useful patterns: chaining (steps in a line), routing (a first step sends the request down the right path), and the generate-critique-revise loop. The intelligence moves out of the prompt and into the architecture — you're the conductor; the calls are specialists.
Your content engine already is this — research, draft, human gate, publish. Áine's briefing is orchestration. This just names the discipline so you can build bigger ones on purpose.
A team of specialized agents, coordinated by an orchestrator that decomposes the goal, delegates to specialists, and synthesizes — powerful, but only when the job truly needs it.
When the steps become autonomous agents (4.5) and you coordinate several, you get a multi-agent system — a manager with specialists. An orchestrator agent takes the big goal, breaks it up, and delegates to worker agents — a researcher, a writer, a critic — who often work in parallel; then it gathers and synthesizes. It's an agency, in software.
You reach for it to get parallelism (many agents at once), specialization (each with a focused role and clean context), and cross-checking (agents review or debate each other). But a team costs more, runs slower, and is harder to control — errors can ripple between agents. So escalate one rung at a time: a prompt, then an agent, then a fixed pipeline, then a team — only when the task genuinely demands it. Keep every agent scoped, gated, and logged.
A project team with a lead. One person handles a small job; a big one needs a manager who delegates to specialists and assembles the result. Staff up only when the work requires it.
The model is a stateless amnesiac, so memory is a system you build around it: write important things to an external store, and retrieve the relevant ones back into context when needed.
Every call, the model knows only what's in its context window; close the session and it forgets. So a model never has memory — you build it around the model. Picture a brilliant amnesiac with a notebook: after each conversation they jot down what mattered; before the next, they flip to the relevant page. The memories live in the notebook, not the head — but functionally, they remember.
Short-term memory is the conversation held in the context window. Long-term memory is the notebook: an external store you write to and read from — which is just RAG (4.4) aimed at the system's own past. The mechanism: distill what matters (often summarize it), store it as embeddings, and retrieve the relevant pieces by meaning next time. The craft is choosing what to keep, compressing it, pulling back only what's relevant, and keeping it fresh.
A system that remembers a client's whole history is more personal, more useful, and stickier — memory is a moat. It's how Áine remembers your business, and how this manual's author remembers you between sessions.
A demo works once; a product works every time, for a stranger, at scale. Since the model is non-deterministic, you engineer a trustworthy system around an unreliable part.
The gap between a concept car and a car you'd actually sell is where most AI projects die. AI is uniquely hard to productionize because its core component is non-deterministic — it can give a different or wrong answer, hallucinate, even flatter you. So the discipline is: assume the model will sometimes fail, and build a system that's trustworthy anyway.
The toolkit, each piece a lesson you already own: validate outputs before trusting them (4.3, 4.4); retry and fall back to another model if one fails (5.4); monitor everything and run evals continuously in production (4.6); control cost and latency (cache, route to cheap models, keep context tight); and keep guardrails and a human gate on consequential actions (4.5). The connective tissue is standard protocols like MCP — a "USB port" that lets models plug into tools and data without custom-building every connection.
This is how "right 96% of the time, and we catch regressions before you see it" becomes true — the machinery behind the promise that lets you charge premium and sleep at night.
The cutting edge — and the skill that makes you permanently self-sufficient: learning to read the research yourself, so you can keep up from the source forever. (In progress.)
You don't keep up by reading everything — you keep up with strong fundamentals plus a filter, and by reading papers strategically, in passes.
The fear is "it's moving too fast." The truth: nobody reads everything. You keep up because almost every new breakthrough is a remix of ideas you already own — attention, scaling, retrieval, agents, RLHF. New work is a puzzle piece, and you have the picture on the box. Filter the firehose: let trusted curators surface what matters, skim abstracts, and go deep only on what's genuinely important or relevant to what you're building.
Read a paper like a detective, not a novel — in passes. Pass 1: title, abstract, figures, conclusion — five minutes for the gist; decide if it's worth more. Pass 2: intro, the core idea, the results — what they did and whether it worked, skipping the math. Pass 3, only if you'll build on it: the details. Always ask three questions — what problem, what's the key idea, and did it actually work (and how do they know)? Stay skeptical of claims, and let AI itself help you digest a paper — then verify.
This is the deepest moat of all. Once you can read the source, you never depend on anyone to tell you what's happening in AI — you learn directly, from the frontier, forever.
This is v2, current through Part IX. As we cover each new lesson, it gets added here — new modules, new diagrams, deeper craft. A living field guide.