Question 1

What is LLM fine-tuning and when do I need it?

Accepted Answer

Fine-tuning is the process of continuing to train a pre-trained language model on your specific data so it adapts to your task, domain, and output requirements. Use fine-tuning when: prompt engineering alone cannot produce consistent output format (the model keeps ignoring your instructions), when you need significant inference cost reduction at scale (a fine-tuned smaller model can outperform a larger model with a long system prompt), or when domain-specific vocabulary significantly degrades base model performance. Fine-tuning is not always the right answer -- start with prompt engineering and RAG first.

Question 2

What models can be fine-tuned?

Accepted Answer

OpenAI fine-tuning API: GPT-4o mini and GPT-3.5 Turbo (hosted fine-tuning, no infrastructure required). Open-source models: Llama 3 (8B, 70B), Mistral 7B, Phi-3, and Gemma (require GPU infrastructure for training). Google Gemini fine-tuning via Vertex AI. The right model depends on: your budget (open-source eliminates per-token costs), your data privacy requirements (open-source runs on your infrastructure), and your accuracy requirements (larger models generally fine-tune to higher accuracy but cost more to run).

Question 3

How much training data do I need for fine-tuning?

Accepted Answer

For OpenAI fine-tuning: 50--100 high-quality examples is the minimum; 500--1,000 is recommended for reliable improvement; 5,000+ for significant domain adaptation. Quality matters more than quantity -- 100 carefully curated examples outperform 10,000 inconsistent ones. For open-source model fine-tuning (full fine-tuning or LoRA/QLoRA adapters): 1,000--50,000 examples depending on the degree of adaptation required. We assess your existing data and help curate or generate training examples if your dataset is thin.

Question 4

What is LoRA fine-tuning?

Accepted Answer

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains a small set of adapter weights rather than the full model. It is dramatically cheaper in compute and memory than full fine-tuning while achieving comparable results for most tasks. QLoRA extends this with quantisation for even lower memory requirements. LoRA is the standard approach for fine-tuning open-source models on modest GPU infrastructure. We use LoRA/QLoRA for open-source model fine-tuning and full fine-tuning only when the task requires it.

Question 5

How do you evaluate whether fine-tuning actually improved the model?

Accepted Answer

We establish a benchmark before fine-tuning -- a representative set of inputs with expected outputs, evaluated on your task-specific metrics (accuracy, format compliance, domain terminology usage, output length consistency). The fine-tuned model is evaluated against this benchmark on a held-out test set. We only recommend proceeding to production deployment when benchmark improvement is statistically significant. Fine-tuning that does not improve over the baseline prompt-engineered base model is not worth the cost.

Question 6

What does LLM fine-tuning cost?

Accepted Answer

Fine-tuning project cost covers training data curation, fine-tuning run costs, evaluation, and deployment. For OpenAI fine-tuning (GPT-4o mini or GPT-3.5), the OpenAI training API costs are low ($1--10 for typical datasets) -- the project cost is primarily in data curation and evaluation work ($8,000--$25,000). For open-source model fine-tuning with infrastructure setup, $20,000--$60,000 including GPU compute, deployment infrastructure, and evaluation framework.

LLM Fine-Tuning Services

When to fine-tune vs. prompt vs. RAG

What we do

Training data curation

Domain adaptation

Output format alignment

Inference cost reduction

Open-source model fine-tuning

Evaluation and regression testing

Not sure if fine-tuning is the right path?