Category: AI

  • Fine-Tuning vs RAG: How We Actually Choose

    Retrieval-Augmented Generation has become the default recommendation for almost every enterprise LLM project, to the point where fine-tuning is treated as exotic or unnecessary. That’s an overcorrection. Both approaches solve real problems; they solve different ones.

    What each approach actually solves

    RAG solves the knowledge freshness problem. The model doesn’t need to know facts — it retrieves them at query time from a store you control. It’s the right tool when the information changes frequently, when you need source attribution, or when the knowledge base is too large to fit in a context window.

    Fine-tuning solves the behaviour and style problem. You can’t RAG your way to a model that consistently responds in a specific tone, formats outputs a specific way, or handles a domain-specific task type reliably.

    The decision matrix

    Need Approach
    Access to up-to-date information RAG
    Consistent output format/structure Fine-tuning
    Domain-specific terminology and tone Fine-tuning
    Attribution and source transparency RAG
    Reducing hallucination on facts RAG
    Few-shot task specialisation Fine-tuning

    What we tell clients who want to start with fine-tuning

    Build the RAG pipeline first. It’s faster, cheaper, and easier to iterate. Fine-tune only after you’ve identified a specific, persistent failure mode that retrieval can’t fix. Fine-tuning on top of a good RAG baseline almost always outperforms fine-tuning alone.

  • LLM Evaluation in Production: Beyond Vibes

    Shipping an LLM-powered feature without an evaluation framework is the ML equivalent of deploying without tests. A prompt change that “seems fine” in dev introduces a regression in a tone edge case you didn’t test. A model upgrade improves average quality but degrades on a specific task segment. Without evals, you find out from users.

    The three-layer eval stack

    1. Unit evals

    Deterministic assertions on known inputs. If the output for a specific input should always contain a citation, assert it. If it should never start with “I”, assert that. These run in CI on every prompt change and take seconds.

    2. Model-graded evals

    For qualities that can’t be asserted deterministically (helpfulness, tone, factual grounding), we use a judge model with a rubric. The judge prompt is versioned alongside the application prompt. These are slower and noisier, but catch regressions unit evals miss.

    3. Human evals

    A sample of real production outputs, rated by a small panel on a defined rubric. We run these before any major prompt change or model upgrade. They’re expensive but they’re ground truth.

    The golden dataset

    The foundation of all three layers is a golden dataset of 200–500 examples: real user inputs, expected output characteristics, and known failure cases. Building this dataset is the hardest and most important work in LLM evaluation. It compounds over time — every production failure becomes a new golden example.

    Making deployment decisions

    We set a threshold: a change must not regress more than 2% on the golden dataset and must not introduce any new failures on a list of “never fail” inputs. If it clears both, it ships.

  • How We Built a Support Chatbot That Actually Deflects Tickets

    The graveyard of failed support chatbots is full of bots that were really just keyword-triggered FAQ lookups with a chat interface. Users ask in natural language, bots match on keywords, the answer misses the question, the user escalates. Everyone loses.

    Retrieval-Augmented Generation, not a fine-tuned model

    We didn’t fine-tune a model on historical tickets. Fine-tuning is expensive to maintain — every time the product changes, the training set goes stale. Instead we used RAG: a vector store of the current documentation, FAQs, and resolved tickets, with a retrieval step before every generation call.

    The retrieval step is where most RAG implementations underperform. We use a hybrid search — dense vector similarity plus BM25 keyword matching, reranked by a cross-encoder. The extra latency (about 200 ms) is worth it for the precision improvement.

    Confidence thresholds and graceful escalation

    When the model isn’t confident — low similarity scores across the retrieved chunks, or a query that doesn’t match the domain — it escalates explicitly: “I’m not confident I can answer this accurately. Let me connect you with the support team.” Users prefer honest escalation to a confidently wrong answer.

    What actually drives the 40% deflection

    The biggest factor wasn’t the model quality — it was documentation quality. The bot can only be as good as the content it retrieves. We spent two weeks rewriting the top 30 FAQ entries to be more specific and answer-first. That single change improved deflection rate by 12 percentage points.

    If you’re building a support bot, audit your documentation before you build anything else.