RAG vs Fine-Tuning: How to Actually Choose

“Should we use RAG or fine-tune?” is one of the most common — and most misframed — questions in applied AI. They solve different problems, and the right answer is often “neither yet.” Here’s a framework that cuts through it.

What each technique actually does

It helps to be precise about the mechanism, because the marketing blurs them.

Retrieval-augmented generation (RAG) injects knowledge into the model at inference time. You retrieve relevant documents and paste them into the prompt. The model’s weights never change; you’re just giving it better context.

Fine-tuning changes the model’s behaviour by continuing training on your examples. It’s good at teaching form, tone, and structure — not facts.

That distinction is the whole game:

RAG is for knowledge. Fine-tuning is for behaviour.

If your problem is “the model doesn’t know about our 2026 product catalog,” that’s knowledge — reach for RAG. If your problem is “the model won’t consistently output our exact JSON format / house style,” that’s behaviour — consider fine-tuning (or just better prompting).

The decision framework

Walk these in order. Stop at the first one that fits.

1. Can a better prompt fix it?

Most “we need to fine-tune” instincts are solved by a clearer prompt, a few examples (few-shot), and a structured output schema. This is the cheapest, fastest, most maintainable option. Exhaust it first.

2. Does the model lack facts it needs?

If the gap is information — internal docs, recent events, user-specific data — use RAG. A minimal pipeline:

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return resp.data[0].embedding

# At write time: chunk docs, embed each chunk, store vectors + text.
# At query time: embed the question, find the nearest chunks, stuff them
# into the prompt as context.
def answer(question: str, store) -> str:
    q_vec = embed(question)
    chunks = store.search(q_vec, k=5)        # nearest-neighbour lookup
    context = "\n\n".join(c.text for c in chunks)
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using only the context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQ: {question}"},
        ],
    ).choices[0].message.content

The hard part isn’t the embedding call — it’s chunking (how you split documents) and retrieval quality (whether you actually surface the right chunks). Spend your time there, not on the model choice.

Where I got this wrong — Josh: On an early project I was convinced we needed to fine-tune because the answers came back vague. I spent the better part of a week assembling training pairs before I realized the model was fine — my chunker was splitting documents mid-sentence, so retrieval was feeding it garbage context. Fixing the chunk boundaries solved it in an afternoon. Now “fine-tune” isn’t allowed in the conversation until I’ve ruled out prompting and retrieval quality.

3. Is the behaviour wrong, repeatably?

If, after great prompting, the model still won’t reliably produce the form you need — and you have hundreds of high-quality examples — fine-tuning earns its keep. Signs you’re ready:

You can articulate the desired output precisely.
You have ≥ a few hundred clean input/output pairs.
The behaviour is stable (you won’t need to change it weekly).

They compose

This is the part people miss: RAG and fine-tuning aren’t mutually exclusive. A mature system often fine-tunes for format and tone, then uses RAG for facts at inference. The fine-tune makes outputs consistent; retrieval keeps them grounded and current.

| Dimension | RAG | Fine-tuning | | ---------------- | ---------------------- | ---------------------- | | Changes | The prompt context | The model weights | | Best for | Knowledge / freshness | Behaviour / format | | Update cost | Re-index documents | Re-train the model | | Time to ship | Hours to days | Days to weeks | | Failure mode | Bad retrieval | Overfitting / drift |

A rule of thumb

Start with prompting. Add RAG when the model needs to know things it doesn’t. Fine-tune only when you need it to consistently act a certain way and you’ve got the data to teach it. Most products never need step three.

Once you’ve got retrieval working, the natural next step is to let a model decide when to retrieve — which is exactly what an agent does. See Build Your First AI Agent in TypeScript for the loop that makes that possible.

RAG vs Fine-Tuning: How to Actually Choose

What each technique actually does

The decision framework

1. Can a better prompt fix it?

2. Does the model lack facts it needs?

3. Is the behaviour wrong, repeatably?

They compose

A rule of thumb

Keep reading

Briefing AI on Your Codebase: The Architecture Context That Actually Matters

Documenting Code with AI: Keeping Docs Honest

Profiling First, Prompting Second: Using AI to Optimize Slow Code