How much data do I need to fine-tune?

It depends on the task. Simple style-transfer tasks can work with a few hundred examples. Complex domain-specific behaviour might need thousands. Anthropic's fine-tuning programme provides specific guidance for each case. Quality matters more than quantity — 200 well-curated examples beats 5000 noisy ones.

Should I fine-tune Claude or use RAG?

Use RAG by default. Add fine-tuning only when prompt engineering plus RAG cannot achieve the quality bar. RAG handles knowledge updates (re-index when documents change); fine-tuning bakes knowledge into weights and is harder to update. Most enterprise use cases are knowledge-bound, not behaviour-bound.

How much does fine-tuning cost?

Fine-tuning has both training cost (one-time) and inference cost (ongoing per token). Anthropic's enterprise programme provides specific pricing. The bigger cost is usually the data-curation effort — assembling the high-quality training examples — which is a humans-in-the-loop investment.

What is Fine-Tuning? | LLM Glossary

Fine-tuning in one paragraph

Fine-tuning takes a pre-trained LLM and continues training it on a curated dataset specific to your domain or task. The result is a model whose weights are adjusted toward your target — it produces outputs more like the examples in your fine-tuning dataset, follows your instructions more reliably, or adopts your domain vocabulary. Fine-tuning is one of three primary mechanisms for adapting an LLM to specific use, alongside prompt engineering and retrieval-augmented generation. It is rarely the right first answer.

When fine-tuning is the right answer

Fine-tuning is genuinely useful for:

Style transfer — making the model adopt your brand voice or domain idioms reliably
Structured output formats — ensuring the model always produces output in your specific JSON or XML shape
Domain vocabulary — embedding deep medical, legal, or technical jargon into the model's preferred terminology
Task-specific behaviour — when prompt engineering can't reliably get the behaviour at the volume and consistency you need

For most enterprise use cases none of these apply. Prompt engineering and RAG are sufficient.

When fine-tuning is the wrong answer

Fine-tuning is the wrong tool when:

You need to add new knowledge — RAG is better; fine-tuning bakes knowledge into weights but does not update easily
You're solving a problem prompt engineering hasn't been tried on — try prompts first
You don't have curated training data — fine-tuning needs hundreds to thousands of high-quality examples
You can't measure quality — fine-tuning without evals will produce a model whose behaviour you can't verify
The base model already does the task well — fine-tuning on what the model can already do is a waste of effort

Most enterprises that ask for fine-tuning are better served by prompt engineering plus RAG. NINtec's Discovery phase produces an honest recommendation.

Fine-tuning Claude

Anthropic's enterprise programme supports fine-tuning Claude for specific customer workloads. The process is more involved than prompt engineering — curated training dataset construction, evaluation methodology, validation that fine-tuning improved on baseline, ongoing model-version migration discipline. NINtec engages with Anthropic's fine-tuning programme on customer engagements where the use case justifies it; we have completed customer engagements where the recommendation was "do not fine-tune, use prompt engineering instead."

Related NINtec capabilities

What is Fine-Tuning? — FAQ

Related resources

More from the resource hub

Glossary

What is a Large Language Model (LLM)?

Glossary

What is Prompt Engineering?

Glossary

What is RAG (Retrieval Augmented Generation)?

Talk to a Claude architect

48-hour response from a senior architect. The Readiness Assessment scopes the work and proposes named engineers.

Request Readiness Assessment