Cutting LLM costs 60% without dumbing your product down

The demo was cheap. The launch was not. That's the story of almost every AI feature I've shipped — a slick prototype that costs cents, followed by a production bill that scales linearly with success. The good news: most of that spend is avoidable, and trimming it rarely means a worse product. After doing this across a couple of products, I land in roughly the same place: about 60% lower cost for output users can't tell apart.

There's no single trick. It's three habits, stacked.

1. Cache aggressively — most prompts repeat

The cheapest token is the one you never send. In real usage, a surprising share of requests are near-duplicates: the same FAQ phrased three ways, the same document summarized twice, the same system prompt with a tiny variable swapped. A cache keyed on the normalized prompt collapses all of that into one paid call.

Start with an exact-match cache — it's trivial and it works more than you'd think. Then add a semantic cache: embed the incoming prompt, and if it's within a similarity threshold of something you've answered, serve the stored response.

llm/cache.ts

// Semantic cache: reuse an answer if a close-enough prompt exists.
async function cachedComplete(prompt: string) {
  const embedding = await embed(prompt);
  const hit = await vectors.search(embedding, { threshold: 0.96 });
  if (hit) return hit.answer;          // $0 — served from cache

  const answer = await complete(prompt);
  await vectors.put(embedding, answer);
  return answer;
}

One caution: a semantic cache is a quality knob, not just a cost knob. Set the threshold too loose and you'll serve a stale answer to a subtly different question. I keep it conservative (0.95–0.97 cosine) and exclude anything time-sensitive.

2. Route requests to the smallest model that can do the job

Not every request deserves your most expensive model. Classifying intent, extracting a field, formatting JSON, answering a simple lookup — a small or local model nails these for a fraction of the price. Reserve the frontier model for genuinely hard reasoning.

The pattern is a cheap router in front of your models. A tiny classifier (often a small local model via Ollama) decides the tier, and only the hard tier hits the premium API.

llm/router.ts

type Tier = "local" | "small" | "frontier";

const MODELS: Record<Tier, string> = {
  local:    "llama3:8b",        // $0, runs on our box
  small:    "gpt-4o-mini",
  frontier: "gpt-4o",
};

async function route(task: Task): Promise<Tier> {
  if (task.kind === "classify" || task.kind === "extract") return "local";
  if (task.tokens < 800 && !task.needsReasoning) return "small";
  return "frontier";
}

Measure before you route. Log per-task quality (a thumbs signal or an eval set) before downgrading a model. The win is only real if the cheaper tier holds quality — otherwise you've traded dollars for support tickets.

3. Right-size context — you're paying per token, both ways

The third lever is the prompt itself. Teams stuff entire documents into context "just in case," then pay for every token on input and watch latency climb. Retrieval fixes this: pull the few chunks that actually matter and leave the rest on disk.

Trim the system prompt. Long instructions get paid for on every single call. Move stable rules into a fine-tune or a terse, well-tested preamble.
Retrieve, don't dump. Top-k relevant chunks beat the whole document — cheaper and usually more accurate, because the model isn't distracted.
Cap output. Set max_tokens deliberately. "Be concise" in the prompt plus a hard ceiling saves more than people expect.

The model bill is a product decision. Treat tokens like any other resource you'd profile and optimize — because at scale, that's exactly what they are.

Putting it together

Stacked, these compound. Caching removes the repeats. Routing makes the survivors cheaper. Right-sizing shrinks what's left. None of them touches the experience for the user — if anything, retrieval and tighter prompts make answers sharper. The 60% isn't a headline number I reverse-engineered; it's roughly where this lands once the three habits are in place and you've stopped paying frontier prices for refund-policy questions.

If you're building an AI feature and the bill is starting to sting, start with caching — it's the fastest win — then add the router. You can ship both in an afternoon, and your finance team will notice by the end of the month.