LLM Fine-Tuning vs RAG vs Prompt Engineering: When to Use Each

Summary

Fine-tuning, RAG, and prompt engineering solve different problems. Prompt engineering changes how the LLM responds using instructions — it costs nothing but cannot add new knowledge. RAG (Retrieval-Augmented Generation) gives the LLM access to your documents at inference time — it costs $10,000–$80,000 to build and is the right choice for knowledge-intensive applications (document Q&A, knowledge bases). Fine-tuning trains the model on your domain data to change its behaviour and style — it costs $20,000–$150,000+ and is the right choice when you need the model to adopt a specific tone, follow proprietary formats, or excel at a narrow task. Most business applications benefit from RAG, not fine-tuning.

Key Takeaways

  • Prompt engineering is free but cannot teach an LLM new facts. If your LLM gives wrong answers because it doesn't know your data, prompt engineering will not fix it.

  • RAG is the right choice for 80% of business knowledge applications — it gives the LLM access to your documents without retraining the model.

  • Fine-tuning is the right choice when you need to change the model's behaviour (tone, format, reasoning style) on a narrow task — not when you need to add knowledge.

  • Fine-tuning requires 1,000–10,000 high-quality labeled examples minimum. If you can't produce that, RAG is your only option.

  • Most teams that choose fine-tuning first discover they needed RAG — and spend 3–6 months on fine-tuning before pivoting. Know the difference before you start.

A client came to us last year with a specific complaint. Their chatbot kept giving wrong answers about their product pricing. The model didn't know about a pricing change made four months earlier.

Their first instinct: "We need to fine-tune the model on our pricing data."

We asked one question. "Would the chatbot answer correctly if you pasted the current pricing page into its context?"

They thought about it. "Yes, probably."

"Then you don't need fine-tuning. You need RAG."

They had already scoped a three-month fine-tuning project. We had them in production with a RAG pipeline in six weeks.

This is the most expensive mistake we see. Teams choose the wrong LLM adaptation method for their problem — not because they don't care, but because nobody explained the difference clearly. Fine-tuning, RAG, and prompt engineering are not interchangeable. They solve different problems. Picking the wrong one does not just waste money. It wastes the months you spend building the wrong thing before you figure it out.

This post is the decision framework.


TL;DR

Prompt engineering is free and should always be your first step — but it cannot teach an LLM new facts. RAG gives the LLM access to your documents at inference time and solves 80% of business knowledge problems for $10,000–$80,000. Fine-tuning changes the model's weights to alter its behaviour (tone, format, narrow task performance) — it costs $20,000–$150,000+ and requires 1,000–10,000 labeled examples. The diagnosis test: "Would the model answer correctly if I pasted the relevant document into its context?" If yes, use RAG. If no, fine-tuning might be the answer.


The three adaptation methods at a glance

Before the details, here is the full picture in one table.

MethodWhat it changesTypical costBest for
Prompt engineeringHow the LLM interprets your instructions$0Behaviour, format, persona, reasoning steps
RAGWhat data the LLM sees at inference time$10,000–$80,000Knowledge-intensive applications — document Q&A, knowledge bases, customer support
Fine-tuningThe model's weights — its internalized patterns$20,000–$150,000+Narrow task performance, domain-specific tone, proprietary output formats

Three tools. Three distinct problems. None of them does what the others do.


Prompt engineering: what it can and cannot do

Prompt engineering means changing how you communicate with the model — the instructions you give it, the examples you include, the format you ask for, the persona you assign it.

It costs nothing. It should always be your first step.

What prompt engineering can do well:

  • Give the model a persona ("You are a customer support agent for Acme Corp. Be friendly and concise.")

  • Define an output format ("Respond only in valid JSON with these fields: ...")

  • Add reasoning steps ("Think step by step before answering.")

  • Provide a handful of examples (few-shot prompting)

  • Set guardrails ("If the user asks about competitors, say you cannot comment on that.")

These are real capabilities. For many narrow, well-defined tasks, good prompt engineering is all you need.

What prompt engineering cannot do:

  • Teach the model facts it was never trained on

  • Give it knowledge of your internal documents, policies, or data

  • Fix a model that consistently makes errors on a specific task because it lacks training examples

  • Change deep patterns in how the model generates text

The knowledge cutoff is fixed. If GPT-4 was trained on data through April 2024, it does not know your Q3 2025 pricing update. Telling it to "be accurate" will not change this. Writing a longer system prompt will not change this. Only RAG or retraining can give the model access to new information — and retraining (fine-tuning) does not actually help as much as people think, for reasons we will get to.

The common mistake: A team notices the model does not know their product. Their instinct is fine-tuning — "train it on our data." The correct fix is almost always RAG. Fine-tuning on factual data is unreliable for knowledge injection and expensive to maintain as facts change. RAG gives the model real-time access to your current documents.

Exhaust prompt engineering first. If the model's problem is knowledge — it simply does not know your data — move to RAG.


RAG: the right tool for 80% of business knowledge problems

RAG (Retrieval-Augmented Generation) is a technique that gives an LLM access to external documents at the time it generates a response.

The mechanics are straightforward. Your documents are stored in a vector database as numerical embeddings. When a user asks a question, the system converts the question to an embedding, searches the vector database for the most relevant document chunks, and injects those chunks into the LLM's context window. The model then generates an answer grounded in what it just read — not from memory.

Think of it like handing a research assistant a relevant excerpt right before they answer a question. They are not relying on what they memorized. They are reading the actual source.

Problems RAG solves:

  • Customer support bots that answer questions about your current products, policies, and pricing

  • Internal knowledge bases where employees can query HR policies, compliance rules, or SOPs

  • Contract review tools that reference specific clauses from actual agreements

  • Compliance assistants that cite the exact regulation being applied

What RAG cannot solve:

  • Changing how the model writes (its tone, style, or output format)

  • Making the model better at a narrow task it performs poorly on

  • Teaching it domain-specific reasoning patterns it was not trained to use

Cost and timeline:

A simple RAG system — one document type, one use case — costs $10,000–$50,000 and takes 4–8 weeks. A production-grade system with multiple sources, access controls, and evaluation infrastructure costs $50,000–$80,000+ and takes 8–16 weeks.

The ongoing cost is modest: LLM API fees ($50–$500/month) and vector database hosting ($50–$300/month).

When RAG fails:

RAG depends on document quality. If your documents are inconsistent, poorly formatted, outdated, or contradictory, your retrieval will be poor and your answers will be unreliable. The most common cause of RAG underperformance is not the architecture — it is dirty source documents.

The second failure mode is poor chunking. Documents broken into chunks that are too small lose context. Chunks that are too large drown the LLM in irrelevant text. Getting the chunking strategy right for your document type is where experienced teams earn their value.

Read the full guide to building a RAG pipeline if you want the technical details on architecture.


Fine-tuning: what it is, what it costs, and when it is worth it

Fine-tuning means taking a pre-trained model (GPT-4, Llama 3, Mistral) and continuing its training on a smaller dataset of your own examples. The model's weights are updated. Its internalized patterns change.

This is meaningfully different from RAG or prompt engineering. You are not changing what information the model has access to. You are changing how the model thinks and responds at a fundamental level.

The three problems fine-tuning actually solves:

1. Domain-specific output format. If you need the model to consistently produce structured medical transcriptions, legal citations in a specific format, or financial analysis in your firm's proprietary template — fine-tuning internalizes that format so the model applies it reliably without needing long format instructions in every prompt.

2. Style and tone adaptation. If you need a model that writes in formal legal language, matches your brand voice precisely, or communicates like a specialist in your domain — fine-tuning adjusts the model's generation patterns. Prompt engineering can get part of the way there. Fine-tuning goes deeper.

3. Narrow task performance. If you have thousands of labeled examples of a specific task — ICD-10 medical code extraction, sentiment classification for your domain, financial entity recognition — fine-tuning on those examples produces a model that outperforms prompt engineering or general-purpose RAG on that task.

What fine-tuning cannot do:

Fine-tuning does not give the model new knowledge it was not trained on. The training cutoff stays. If you fine-tune GPT-4 on examples from your 2024 policy documents and those policies change in 2025, the fine-tuned model does not know about the change.

This is why fine-tuning is a poor choice for knowledge injection. Companies that fine-tune a model on their internal docs expecting it to "know" their company discover this problem quickly: the model sounds better (it has adapted its tone) but still gets facts wrong (the facts it learned are now stale).

Data requirements:

Fine-tuning needs data. Real, labeled, high-quality data.

  • Minimum: 1,000 examples (input → desired output pairs)

  • Recommended: 3,000–10,000 examples for reliable performance

  • Each example must be representative of the task you want the model to learn

If you cannot produce 1,000 high-quality examples, you cannot fine-tune. Full stop. There is no workaround. If your dataset is not there yet, RAG is your only viable path.

Cost:

PhaseWhat happensTimeCost range
Data preparationCollect, clean, and label training examples4–12 weeks$10,000–$80,000
TrainingRun fine-tuning via API or self-hosted1–4 weeks$5,000–$30,000
Evaluation and iterationTest, identify gaps, retrain4–8 weeks$5,000–$40,000
Total9–24 weeks$20,000–$150,000+

The data preparation phase is almost always the bottleneck. Collecting and labeling 5,000 high-quality examples takes serious time and effort. Teams consistently underestimate this.

If you use a provider API (OpenAI fine-tuning, Anthropic fine-tuning), the training compute itself is fast — often hours, not weeks. What takes weeks is getting the data ready.


The decision framework: which method for which problem

Here is the test. Work through these questions in order.

Step 1: Is the model's problem a knowledge gap?

Ask: "Would the model answer correctly if I pasted the relevant document into its context window?"

If yes — the problem is knowledge, not behaviour. Use RAG. The model knows how to answer; it just lacks the information.

If no — the problem might be behaviour (how it reasons or writes). Move to Step 2.

Step 2: Is the problem about how the model behaves — its tone, format, or reasoning approach?

Ask: "Is the output wrong because the model doesn't know the information, or because it responds in the wrong way (wrong format, wrong style, wrong reasoning pattern)?"

If wrong format or style — fine-tuning is likely the right tool, provided you have enough data.

If wrong information — go back to RAG.

Step 3: Do you have the data to fine-tune?

Can you produce 1,000+ high-quality labeled input-output examples? Can you maintain and update them as requirements evolve?

If no — fine-tuning is not viable right now. Use RAG + prompt engineering.

If yes — fine-tuning is a viable option.

Step 4: Have you tried prompt engineering first?

Always. If the problem can be solved with better instructions, few-shot examples, or a clearer system prompt, that saves months of build time and thousands of dollars.

The summary decision tree:

Is the problem a knowledge gap (the model doesn't know your data)?
  → YES: Use RAG
  → NO: Is the problem about behaviour (tone, format, narrow task)?
      → YES: Do you have 1,000+ labeled examples?
          → YES: Fine-tuning is viable
          → NO: Use RAG + prompt engineering until you have data
      → NO: Try better prompt engineering first

The combined RAG + fine-tuning approach

Some applications need both.

A medical coding assistant is a good example. It needs to extract ICD-10 codes from clinical notes (a narrow task with a specific output format — fine-tuning territory) and it also needs to reference the latest clinical guidelines and payer-specific rules (a knowledge problem — RAG territory).

For this application, you would:

  1. Fine-tune the base model on thousands of examples of correct ICD-10 extraction (format and reasoning)
  2. Build a RAG layer over your clinical guidelines database (knowledge retrieval)
  3. At inference time, retrieve relevant guideline sections and pass them to the fine-tuned model

The combined approach costs $50,000–$200,000+ and requires real ML expertise. It is the right answer for complex domain-specific applications where neither RAG alone nor fine-tuning alone is sufficient.

Build order matters. Always build RAG first. Evaluate it thoroughly. Only add fine-tuning if RAG + prompt engineering leaves a clear gap that fine-tuning would close. Most teams that start with fine-tuning skip this sequence and pay for it.


Real costs and timelines side by side

MethodBuild costTimelineOngoing costMaintenance burden
Prompt engineering$0–$5,000Days–2 weeks$0Low — update prompts as needed
RAG (simple)$10,000–$50,0004–8 weeks$100–$800/monthMedium — update documents, monitor accuracy
RAG (production)$50,000–$80,000+8–16 weeks$300–$1,500/monthMedium-high — multi-source maintenance, evaluations
Fine-tuning$20,000–$150,000+9–24 weeks$200–$2,000/monthHigh — retrain when behaviour drifts, maintain data pipeline
RAG + fine-tuning$50,000–$200,000+16–32 weeks$500–$3,000/monthVery high — both systems require ongoing attention

A few things worth noting from this table.

First, the fine-tuning ongoing cost is real and often underestimated. A fine-tuned model's performance drifts over time. Requirements change. The domain evolves. You will need to retrain — and retraining means going through the data preparation process again.

Second, RAG's ongoing cost is dominated by document maintenance, not infrastructure. If your documents are clean and well-organized to start, ongoing maintenance is manageable. If they are not, the problem compounds.

Third, prompt engineering has essentially no ongoing infrastructure cost but does require human time to update and maintain prompts as the product evolves. It is not truly "free" — just cheap.


The most common mistake

Here is what we see repeatedly.

A team wants to build an AI assistant for their business. The model does not know their internal data — their products, their pricing, their policies. They read about fine-tuning. It sounds powerful. They scope a fine-tuning project.

Three months in, they have labeled 2,000 examples, run the fine-tuning job, and deployed the model. It sounds better. It has absorbed some of the domain vocabulary. But it still gets facts wrong.

The pricing it learned during fine-tuning is already outdated. A product was discontinued after training. A policy changed. The model is confidently wrong.

They rebuild with RAG. The system is in production six weeks later.

This is not a made-up scenario. It is a pattern we have seen enough times that we ask every prospective client the same diagnostic question before scoping anything: "Would the model answer correctly if you pasted the relevant document into its context?"

If the answer is yes, the problem is knowledge access — and that is a RAG problem, not a fine-tuning problem.

The question saves teams months of work.

Fine-tuning is a real tool. It solves real problems. But the problems it solves are narrower than most teams assume. Format adaptation. Tone. A specific reasoning pattern on a well-defined task with abundant labeled data. These are fine-tuning problems.

"The model doesn't know our data" is almost never a fine-tuning problem.


Questions to ask before committing to an approach

Before choosing fine-tuning:

  • Do we have 1,000+ high-quality labeled examples ready, or do we need to create them?

  • How often does the underlying behaviour we want to teach the model change? (Frequent changes mean expensive retraining cycles.)

  • Has a prompt engineering approach been tried and found insufficient? (If not, start there.)

  • Would RAG solve the core problem if we built it first?

Before choosing RAG:

  • Are our source documents clean, current, and well-structured?

  • Do we have a clear test set of 20–30 representative questions we can score the system against?

  • Do we understand who gets access to which documents? (Multi-tenant access control adds significant complexity.)

  • Is there a knowledge problem here at all, or is the issue how the model writes — which RAG cannot fix?

Before choosing prompt engineering as a long-term solution:

  • Is the context we need to inject small enough to fit reliably in the prompt, or will it grow over time?

  • Will this approach scale if query volume increases significantly?

  • Are we deferring a knowledge or behaviour problem that will eventually require RAG or fine-tuning?


Ready to choose the right approach?

Most business AI applications need RAG, not fine-tuning. Most teams that start with fine-tuning discover this three months later.

At RaftLabs, we help engineering teams and CTOs evaluate their specific use case, match it to the right adaptation method, and build the technical implementation — whether that is a RAG pipeline, LLM integration, fine-tuning pipeline, or a combination.

We start every engagement with the diagnostic conversation — not with a default recommendation. If you have a specific AI use case and want an honest assessment of what it will actually take, talk to our team.

Frequently Asked Questions

LLM fine-tuning is the process of continuing training a pre-trained language model (GPT-4, Llama 3, Mistral) on a smaller dataset of your domain-specific examples. The model's weights are updated to improve performance on your specific task. Fine-tuning costs $20,000–$150,000+ to implement and requires 1,000–10,000 high-quality labeled examples. It is best suited for applications where the model needs to adopt a specific tone, follow proprietary output formats, or perform well on a narrow task.
RAG is a technique that gives an LLM access to external documents at the time of generating a response. Instead of relying only on what the model learned during training, a RAG system retrieves relevant document chunks from a vector database and includes them in the model's context window. The model then generates a response grounded in those retrieved documents. RAG costs $10,000–$80,000 to build and is the right choice for document Q&A, knowledge bases, and any application where the LLM needs to answer questions about your company's data.
Fine-tuning is better than RAG when the problem is behaviour, not knowledge. If you need the model to respond in a specific tone (formal legal language, brand voice), follow a proprietary output format (structured JSON for your system), or perform well on a narrow task (medical code extraction, financial statement parsing), fine-tuning adjusts the model's weights to internalise these patterns. RAG cannot change how the model behaves — it can only change what information the model has access to.
Yes. RAG + fine-tuning is used for complex applications where you need both accurate knowledge retrieval and domain-specific behaviour. Example — a legal AI assistant might use fine-tuning to adopt legal writing style and RAG to retrieve relevant case law and statutes. The combined approach costs $50,000–$200,000+ and requires significant ML expertise. Most businesses should exhaust prompt engineering and RAG before considering fine-tuning, and only add fine-tuning if both prove insufficient.
Fine-tuning timeline breaks into three phases — data preparation (collecting and labeling 1,000–10,000 examples: 4–12 weeks), training (running fine-tuning via the model provider API or self-hosted: 1–4 weeks), and evaluation and iteration (testing and improving: 4–8 weeks). Total: 9–24 weeks. If you are using a provider API (OpenAI, Anthropic), training itself is fast (hours). The bottleneck is almost always data preparation — getting enough high-quality labeled examples.