Skip to content
0degrees.ai
RAG

RAG vs Fine-Tuning: How to Actually Choose

A practical decision framework for when to reach for retrieval-augmented generation, when to fine-tune, and when to do neither — with the trade-offs that matter.

0degrees Team 3 min read

“Should we use RAG or fine-tune?” is one of the most common — and most misframed — questions in applied AI. They solve different problems, and the right answer is often “neither yet.” Here’s a framework that cuts through it.

What each technique actually does

It helps to be precise about the mechanism, because the marketing blurs them.

Retrieval-augmented generation (RAG) injects knowledge into the model at inference time. You retrieve relevant documents and paste them into the prompt. The model’s weights never change; you’re just giving it better context.

Fine-tuning changes the model’s behaviour by continuing training on your examples. It’s good at teaching form, tone, and structure — not facts.

That distinction is the whole game:

RAG is for knowledge. Fine-tuning is for behaviour.

If your problem is “the model doesn’t know about our 2026 product catalog,” that’s knowledge — reach for RAG. If your problem is “the model won’t consistently output our exact JSON format / house style,” that’s behaviour — consider fine-tuning (or just better prompting).

The decision framework

Walk these in order. Stop at the first one that fits.

1. Can a better prompt fix it?

Most “we need to fine-tune” instincts are solved by a clearer prompt, a few examples (few-shot), and a structured output schema. This is the cheapest, fastest, most maintainable option. Exhaust it first.

2. Does the model lack facts it needs?

If the gap is information — internal docs, recent events, user-specific data — use RAG. A minimal pipeline:

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    resp = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return resp.data[0].embedding

# At write time: chunk docs, embed each chunk, store vectors + text.
# At query time: embed the question, find the nearest chunks, stuff them
# into the prompt as context.
def answer(question: str, store) -> str:
    q_vec = embed(question)
    chunks = store.search(q_vec, k=5)        # nearest-neighbour lookup
    context = "\n\n".join(c.text for c in chunks)
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer using only the context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQ: {question}"},
        ],
    ).choices[0].message.content

The hard part isn’t the embedding call — it’s chunking (how you split documents) and retrieval quality (whether you actually surface the right chunks). Spend your time there, not on the model choice.

3. Is the behaviour wrong, repeatably?

If, after great prompting, the model still won’t reliably produce the form you need — and you have hundreds of high-quality examples — fine-tuning earns its keep. Signs you’re ready:

  • You can articulate the desired output precisely.
  • You have ≥ a few hundred clean input/output pairs.
  • The behaviour is stable (you won’t need to change it weekly).

They compose

This is the part people miss: RAG and fine-tuning aren’t mutually exclusive. A mature system often fine-tunes for format and tone, then uses RAG for facts at inference. The fine-tune makes outputs consistent; retrieval keeps them grounded and current.

DimensionRAGFine-tuning
ChangesThe prompt contextThe model weights
Best forKnowledge / freshnessBehaviour / format
Update costRe-index documentsRe-train the model
Time to shipHours to daysDays to weeks
Failure modeBad retrievalOverfitting / drift

A rule of thumb

Start with prompting. Add RAG when the model needs to know things it doesn’t. Fine-tune only when you need it to consistently act a certain way and you’ve got the data to teach it. Most products never need step three.

Once you’ve got retrieval working, the natural next step is to let a model decide when to retrieve — which is exactly what an agent does. See Build Your First AI Agent in TypeScript for the loop that makes that possible.

Build AI software, the right way.

Get new tutorials on agents, RAG and shipping LLM apps — straight to your inbox. No spam, unsubscribe anytime.

Keep reading

Agents 3 min read

Build Your First AI Agent in TypeScript

A from-scratch walkthrough of the agent loop — tools, reasoning, and termination — using the Claude API and plain TypeScript. No frameworks.

0degrees Team