Category: AI

Fine-Tuning vs RAG: How We Actually Choose

Retrieval-Augmented Generation has become the default recommendation for almost every enterprise LLM project, to the point where fine-tuning is treated as exotic or unnecessary. That’s an overcorrection. Both approaches solve real problems; they solve different ones.

What each approach actually solves

RAG solves the knowledge freshness problem. The model doesn’t need to know facts — it retrieves them at query time from a store you control. It’s the right tool when the information changes frequently, when you need source attribution, or when the knowledge base is too large to fit in a context window.

Fine-tuning solves the behaviour and style problem. You can’t RAG your way to a model that consistently responds in a specific tone, formats outputs a specific way, or handles a domain-specific task type reliably.

The decision matrix

Need	Approach
Access to up-to-date information	RAG
Consistent output format/structure	Fine-tuning
Domain-specific terminology and tone	Fine-tuning
Attribution and source transparency	RAG
Reducing hallucination on facts	RAG
Few-shot task specialisation	Fine-tuning

What we tell clients who want to start with fine-tuning

Build the RAG pipeline first. It’s faster, cheaper, and easier to iterate. Fine-tune only after you’ve identified a specific, persistent failure mode that retrieval can’t fix. Fine-tuning on top of a good RAG baseline almost always outperforms fine-tuning alone.

September 17, 2025

LLM Evaluation in Production: Beyond Vibes

Shipping an LLM-powered feature without an evaluation framework is the ML equivalent of deploying without tests. A prompt change that “seems fine” in dev introduces a regression in a tone edge case you didn’t test. A model upgrade improves average quality but degrades on a specific task segment. Without evals, you find out from users.

The three-layer eval stack

1. Unit evals

Deterministic assertions on known inputs. If the output for a specific input should always contain a citation, assert it. If it should never start with “I”, assert that. These run in CI on every prompt change and take seconds.

2. Model-graded evals

For qualities that can’t be asserted deterministically (helpfulness, tone, factual grounding), we use a judge model with a rubric. The judge prompt is versioned alongside the application prompt. These are slower and noisier, but catch regressions unit evals miss.

3. Human evals

A sample of real production outputs, rated by a small panel on a defined rubric. We run these before any major prompt change or model upgrade. They’re expensive but they’re ground truth.

The golden dataset

The foundation of all three layers is a golden dataset of 200–500 examples: real user inputs, expected output characteristics, and known failure cases. Building this dataset is the hardest and most important work in LLM evaluation. It compounds over time — every production failure becomes a new golden example.

Making deployment decisions

We set a threshold: a change must not regress more than 2% on the golden dataset and must not introduce any new failures on a list of “never fail” inputs. If it clears both, it ships.

August 17, 2025
How We Built a Support Chatbot That Actually Deflects Tickets

The graveyard of failed support chatbots is full of bots that were really just keyword-triggered FAQ lookups with a chat interface. Users ask in natural language, bots match on keywords, the answer misses the question, the user escalates. Everyone loses.

Retrieval-Augmented Generation, not a fine-tuned model

We didn’t fine-tune a model on historical tickets. Fine-tuning is expensive to maintain — every time the product changes, the training set goes stale. Instead we used RAG: a vector store of the current documentation, FAQs, and resolved tickets, with a retrieval step before every generation call.

The retrieval step is where most RAG implementations underperform. We use a hybrid search — dense vector similarity plus BM25 keyword matching, reranked by a cross-encoder. The extra latency (about 200 ms) is worth it for the precision improvement.

Confidence thresholds and graceful escalation

When the model isn’t confident — low similarity scores across the retrieved chunks, or a query that doesn’t match the domain — it escalates explicitly: “I’m not confident I can answer this accurately. Let me connect you with the support team.” Users prefer honest escalation to a confidently wrong answer.

What actually drives the 40% deflection

The biggest factor wasn’t the model quality — it was documentation quality. The bot can only be as good as the content it retrieves. We spent two weeks rewriting the top 30 FAQ entries to be more specific and answer-first. That single change improved deflection rate by 12 percentage points.

If you’re building a support bot, audit your documentation before you build anything else.

May 22, 2025

Category: AI

Fine-Tuning vs RAG: How We Actually Choose

What each approach actually solves

The decision matrix

What we tell clients who want to start with fine-tuning

LLM Evaluation in Production: Beyond Vibes

The three-layer eval stack

1. Unit evals

2. Model-graded evals

3. Human evals

The golden dataset

Making deployment decisions

How We Built a Support Chatbot That Actually Deflects Tickets

Retrieval-Augmented Generation, not a fine-tuned model

Confidence thresholds and graceful escalation

What actually drives the 40% deflection