A B2B SaaS support team integrated an Anthropic Claude-powered agent into their existing helpdesk, with eval suites, guardrails and a clean human hand-off. CSAT went up. Costs went down.
Inbound tickets had grown 4× in 18 months. The support team had grown 1.5×. The CEO wanted AI; the support lead wanted reliability and didn't want a chatbot her customers would resent.
Two prior vendors had pitched 'AI support' that turned out to be templated chatbots with light intent detection. The client had been burned twice. Their brief: 'Build something we'd be happy demoing to our customers, not embarrassed by.'
Before writing a prompt, we built an eval set: 400 real tickets, each with the "correct" outcome judged by senior support staff. Every change we made — model, prompt, retrieval — was scored against this set. No vibes-based deployment.
The agent uses Anthropic Claude with a RAG layer over the client's docs and historical tickets. It's allowed to resolve tickets in defined categories; everything else routes to a human with a one-paragraph summary already drafted. Failure modes (low confidence, sensitive flag, customer frustration signals) trigger immediate hand-off — and we built the dashboards to know when these were happening.
We deployed gradually: 10% of inbound traffic for two weeks, then 30%, then full rollout. Eval scores held. CSAT for AI-resolved tickets is now higher than the human team's average — partly because the agent answers within seconds.
We rejected two vendors before this. The eval suite is what flipped me — finally something we could actually measure, not just hope worked.
More from the studio.
Tell us what you're working on. We'll send a proposal you can share internally.