Prompt engineering vs fine-tuning: when does each one earn its keep?

The first big architectural argument on most AI projects is some version of do we just prompt this, do we add a retrieval layer, or do we go straight to fine-tuning?The argument is rarely productive because the three options are not really substitutes — they sit at different points on the cost, control, and capability curve, and the right answer is usually a combination.

What follows is the decision tree we walk new clients through before they spend a quarter training a model they did not need.

Start with prompting. Almost always.

A surprising amount of what gets pitched as a fine-tuning problem is actually a prompting problem in disguise. The team tried two examples in ChatGPT, didn't love the result, and jumped to we need our own model. Most of the time we can close the gap with disciplined prompt engineering in a week.

What counts as disciplined: a written prompt with a clear role, an explicit task description, well-chosen few-shot examples drawn from the actual data distribution, and an output schema the downstream code can rely on. We almost always pair that with structured outputs (JSON mode, tool calls, or constrained decoding) so the prompt is graded against an objective rubric rather than vibes.

When prompting is enough, you should be done. You will not beat the iteration speed and you will not beat the cost.

If the model lacks the knowledge, reach for retrieval

Prompting cannot teach the model something it does not know. If the gap is this model has never seen our internal policy documents, the answer is retrieval-augmented generation, not fine-tuning. RAG is faster to build, faster to update (you change a document, the answer changes), and dramatically easier to audit because the source passages are right there in the response.

We reach for fine-tuning over RAG only when the corpus is either too small to retrieve from usefully (a few hundred carefully labelled examples) or so deeply patterned that we want the model to internalise the behaviour rather than re-derive it from passages every time. That second case is rarer than people think.

What fine-tuning is actually good at

Fine-tuning earns its keep on three concrete jobs:

Style and format compliance. Producing output in a very specific tone, structure, or terminology that's tedious to keep re-explaining in a long prompt.
Latency-sensitive narrow tasks. Distilling a frontier model's behaviour on one specific task into a much smaller, much cheaper, much faster model that you can run at scale.
Capability gaps prompting can't close.Domain reasoning patterns the base model genuinely does not have — most often in highly specialised vertical workflows like clinical triage, legal drafting in a specific jurisdiction, or trading desk shorthand.

What fine-tuning is not good at: teaching the model new facts it can look up. You will train it to confidently hallucinate the wrong fact instead.

The cost shape, honestly

Prompting and RAG are essentially free to iterate on. You change a string, you redeploy a config, you ship the change in an afternoon. Fine-tuning has a real fixed cost — labelled data, training runs, evaluation harness, model versioning — before you ever see a benefit.

The labelled data is usually where the real expense is. A decent supervised fine-tune wants somewhere between 500 and 5,000 high-quality examples for most narrow tasks. That data has to come from somewhere; if you do not already have it, the cost of producing it dwarfs every other line item.

The trade-off makes sense once you are running enough volume that the per-token savings of a smaller fine-tuned model repays the upfront investment within a couple of months. If you cannot draw that line on a napkin, you are probably not ready to fine-tune.

The combination almost everyone ends up with

For mature production systems, the question is rarely prompting or fine-tuning. It's some version of: a frontier model behind a careful prompt for the hard reasoning steps, a retrieval layer feeding it the proprietary knowledge it doesn't have, and one or two small fine-tuned models in front of or behind that pipeline doing specific narrow jobs (classification, format normalisation, fast first-pass triage) at a fraction of the cost.

That layered architecture is what almost every production system we run looks like once you peel back the marketing slide. The trick is to build it in that order — prompting first, retrieval second, fine-tuning only when the first two stop being enough.

Decision ladder

When to climb from prompting to fine-tuning

Start at prompting. Add RAG when grounding in private data is the bottleneck. Reach for fine-tuning only when neither has closed the gap.

The decision tree, in three lines

When a team asks us which one to reach for, our answer collapses into three checks:

Can a careful prompt with structured output get us 90% of the way? If yes, ship that and move on.
Is the gap a knowledge gap? If yes, build retrieval before you train anything.
Is the gap a behaviour gap that retrieval can't close, and is the volume high enough to justify the engineering? Now fine-tuning is on the table.

Most teams that follow that order ship something useful in weeks instead of quarters, and they end up with a system they can actually evolve.