How much data do you need to fine-tune a model?

The minimum depends on the task and base model. OpenAI recommends at least 50 to 100 high-quality examples for basic fine-tuning, though most production use cases benefit from 500 to several thousand examples. Quality matters far more than quantity - 200 carefully curated examples often outperform 2,000 noisy ones.

How much does fine-tuning cost?

Costs vary significantly by provider and model size. Fine-tuning GPT-4o mini through OpenAI starts at roughly 3 dollars per million training tokens. Larger models cost proportionally more. Beyond API costs, the primary expense is curating training data, which requires significant human time for labeling, formatting, and quality review.

What is the difference between fine-tuning and prompt engineering?

Prompt engineering adjusts model behavior by crafting better instructions at inference time without changing the model itself. Fine-tuning permanently modifies the model's weights through additional training. Prompt engineering is faster and cheaper to iterate. Fine-tuning produces more consistent results for specialized tasks but requires training data and compute resources.

Can you fine-tune open-source models?

Yes, open-source models like Meta's LLaMA and Mistral are popular choices for fine-tuning because you have full control over the training process and can run the resulting model on your own infrastructure. This avoids ongoing API costs and keeps sensitive training data private, though it requires machine learning expertise and GPU resources.

What Is Fine-Tuning an AI Model?

Fine-tuning is the process of taking a pre-trained large language model and training it further on a smaller, task-specific dataset to improve its performance on a particular domain, style, or use case. Rather than training a model from scratch - which requires billions of data points and millions of dollars in compute - fine-tuning adapts an existing model's capabilities using hundreds or thousands of carefully selected examples. According to Stanford HAI's 2024 AI Index Report, fine-tuning has become the dominant approach for deploying specialized AI systems, with over 60 percent of enterprise AI projects using some form of fine-tuned model rather than general-purpose base models.

How Does Fine-Tuning Work?

Fine-tuning builds on the concept of transfer learning - the idea that knowledge learned from one task can be applied to another. A base LLM has already learned general language understanding from training on trillions of words. Fine-tuning adjusts the model's internal weights to specialize this general knowledge for a specific purpose.

The process follows three stages.

Data preparation. You assemble a dataset of input-output examples that demonstrate the behavior you want. For a customer support bot, this might be hundreds of pairs of customer questions and ideal responses. For a medical summarization tool, it might be clinical notes paired with concise summaries. The format varies by provider, but most use a conversational structure with system prompts, user messages, and assistant responses.

Training. The model processes your examples and adjusts its weights to better reproduce the patterns in your data. This typically takes minutes to hours depending on dataset size and model architecture - a fraction of the weeks or months needed for pre-training from scratch. The training process uses a lower learning rate than pre-training to preserve the model's general capabilities while adapting to the new task.

Evaluation. After training, you test the fine-tuned model against a held-out validation set to measure whether performance actually improved. Common pitfalls include overfitting - where the model memorizes training examples rather than learning generalizable patterns - and catastrophic forgetting - where the model loses general capabilities it had before fine-tuning.

When Should You Fine-Tune vs. Use Other Approaches?

Fine-tuning is one of several methods for customizing AI behavior. Choosing the right approach depends on your specific needs.

Fine-Tuning vs. Prompt Engineering

Prompt engineering is the simplest customization method - you craft better instructions and examples within the prompt itself, without modifying the model. Prompt engineering is the right starting point for most use cases because it is fast, cheap, and reversible. Fine-tuning becomes necessary when prompt engineering hits its limits: when you need a specific output style that is hard to describe in instructions, when you want to reduce token usage by embedding knowledge into the model rather than the prompt, or when you need the model to consistently follow complex formatting rules.

Fine-Tuning vs. RAG

Retrieval-augmented generation gives the model access to external knowledge at query time by retrieving relevant documents and including them in the context. RAG is better when you need the model to reference current, updatable information. Fine-tuning is better when you need to change the model's behavior, tone, or reasoning patterns. Many production systems combine both - a fine-tuned model that also uses RAG to access current data.

When Fine-Tuning Makes Sense

Fine-tuning is typically the right choice in these scenarios.

Consistent style or voice. You need every output to match a specific brand voice, writing style, or formatting convention. A media company fine-tuning a model to write in their editorial voice gets more consistent results than prompting a general model to "write like our publication."

Domain specialization. The model needs deep knowledge of a specific field - legal terminology, medical coding, financial regulations - that is underrepresented in general training data.

Cost optimization. A shorter, fine-tuned prompt can replace a longer prompt that includes extensive instructions and examples, reducing per-query token costs. For high-volume applications, this savings can be substantial.

Latency reduction. Shorter prompts process faster. Fine-tuning can move information from the prompt into the model's weights, reducing both cost and response time.

What Are the Limitations of Fine-Tuning?

Data quality requirements. The output is only as good as the training data. Noisy, inconsistent, or biased examples produce a model that reproduces those flaws. Curating high-quality training datasets requires significant machine learning expertise and human effort.

Overfitting risk. With small datasets, the model may memorize specific examples rather than learning general patterns. Overfitted models perform well on inputs similar to the training data but poorly on novel inputs.

Maintenance burden. Fine-tuned models can become outdated as your domain evolves. Unlike RAG, where you simply update the knowledge base, updating a fine-tuned model requires assembling new training data and rerunning the training process.

Base model dependency. When the provider releases a new base model version, your fine-tuning does not automatically transfer. You need to re-fine-tune on the new base model, which means maintaining your training pipeline and dataset indefinitely.

Evaluation difficulty. Measuring whether fine-tuning actually improved performance is harder than it sounds. You need well-designed evaluation benchmarks that test the specific behaviors you care about, not just general quality metrics.

How Does Fine-Tuning Fit Into the AI Stack?

Fine-tuning sits between prompt engineering and pre-training in the spectrum of model customization. Most teams should start with prompt engineering, move to RAG if they need external knowledge, and consider fine-tuning only when the other approaches fall short.

In practice, the most effective AI systems layer all three approaches. A fine-tuned model provides the right base behavior and style. RAG supplies current, domain-specific knowledge. Prompt engineering handles per-query customization and instructions. Understanding when each technique applies - and how they complement each other - is becoming a core competency for teams building AI-powered products and workflows.