Most conversational AI for customer service fails for a boring reason. The model is fine. The system around it was never designed. A team buys a capable language model, points it at an old help center, and waits for resolution rates to climb. Instead the agent confidently invents a refund policy, a customer escalates, and confidence in the whole project quietly collapses.
We have deployed conversational agents across WhatsApp and web for support teams, and the same pattern repeats. The hard parts are almost never the parts vendors demo. Here is what actually decides whether a deployment holds up.
The model is the easy part
A modern language model can already hold a coherent support conversation. That capability is commoditized. What separates a working deployment from a pilot that gets switched off after a quarter is everything the model touches: the knowledge it draws from, the moment it decides to hand off, and the channel it runs on.
Treat the model as one component, not the product. The product is the system that keeps the model honest. When teams skip that framing, they spend weeks tuning prompts and zero time on the parts that break in front of customers.
The shift in mindset matters. A model is something you prompt. A system is something you operate, with owners, review cycles, and a feedback loop that carries real conversations back into the knowledge and the rules. Budget for the second one, because that is the work that compounds.
Your knowledge base is the real bottleneck
A conversational AI agent is only as good as the knowledge it can reach. Point one at a stale help center and you get fluent, confident wrong answers, which are worse than no answer at all. A wrong answer delivered with certainty is the fastest way to lose a customer's trust in the channel.
Most support knowledge lives in three messy places: outdated public articles, an internal wiki nobody maintains, and the heads of two senior agents who answer the same edge cases over Slack every week. Before you measure deflection, audit what the agent can actually retrieve. Can it find the current return window? Does it know which products are discontinued? If a human would need to ask a colleague, the agent will guess.
This is the part we push customers to invest in first with Reach: structuring business knowledge so the agent answers from a source you control, not from patterns in its training data. An agent that says it does not have a detail and pulls in a teammate beats one that fabricates a number every time.
Knowledge work is not a one-time setup either. Products change, policies change, and an agent that was accurate in March drifts quietly wrong by June if nobody feeds it the updates. The teams that win assign ownership of the knowledge the same way they assign ownership of code, with a clear person responsible for keeping it current.
The handoff is where systems fail
If a deployment breaks in public, it usually breaks at the handoff. The agent hits something it cannot resolve, passes the customer to a human, and the human opens a blank screen. Now the customer repeats the account number they typed ninety seconds ago, and the experience feels worse than if no agent had been involved.
Passing a raw transcript is not passing context. A useful handoff carries the customer's verified identity, the intent the agent already classified, what it already tried, and the specific reason it escalated. The receiving human should start a step ahead, not from scratch.
Three triggers should force a handoff every time: low confidence, anything that touches policy or personal data, and detected frustration. Get these wrong and you either escalate everything, which defeats the purpose, or escalate nothing, which burns people. Reach handles this with explicit handoff rules and a conversation view that hands the human the full thread plus the structured fields the agent already collected.
WhatsApp and web are not the same channel
Teams often treat channel as a deployment checkbox. It is not. The same agent logic behaves differently on WhatsApp than it does in a website widget, and ignoring that gap is how good agents earn bad reputations.
Web chat is session-bound. The customer is on the page, expects fast turns, and leaves when the tab closes. WhatsApp is asynchronous and long-lived. Someone asks a question, disappears for four hours, then replies mid-thread expecting the agent to remember everything. An agent tuned for snappy web sessions feels broken on WhatsApp the moment it loses the thread.
A customer who starts on your site and follows up on WhatsApp should not have to start over. That continuity is an architecture decision made before launch, not a feature you bolt on after the first complaint. Decide early how identity and conversation state travel across channels, because retrofitting it later means rebuilding the core.
Metrics that tell you it is working
Containment rate, the share of conversations resolved without a human, is the number most teams chase. On its own it is misleading. An agent can post high containment by frustrating people until they give up and close the chat.
Pair containment with handoff quality and what happens after resolution. Track how often handed-off conversations get solved on first human touch, and whether customers reopen the same issue within a few days. A deployment with lower containment and clean handoffs usually beats one with high containment and silent failures hiding underneath.
Run simulations before and during rollout. Replay real past conversations against the agent and read where it goes wrong. This catches the confident-wrong-answer problem on your side of the screen, before a customer finds it on theirs.
What it is still bad at
Be honest about the ceiling. Conversational AI handles bounded, well-documented questions well. It struggles with anything that needs judgment about an unusual situation, reading genuine emotion, or making an exception nobody wrote down. Those are exactly the moments that decide how a customer remembers your brand, so they are worth protecting, not automating away.
It also struggles when one question spans systems: billing, shipping, and a third-party warranty in a single thread. The agent can stitch those together only if it has access to all three, and most do not. Knowing where the agent's competence ends, and routing cleanly past that line, is worth more than pretending the line is not there.
A deployment sequence that holds up
Start narrow. Pick one high-volume, low-risk intent, such as order status, password resets, or basic billing questions, and get the agent resolving it cleanly with a working handoff path. Resist the urge to launch across every topic at once because the demo looked good.
Then expand by intent, not by channel. Add question types the agent handles confidently, widen the knowledge it draws from, and only open new channels once the core loop is stable. Most failed rollouts inverted this order: broad scope, thin knowledge, no handoff design, live everywhere on day one.
Conversational AI for customer service is not a model you switch on. It is a system you operate, watch, and correct. The teams that treat it that way are the ones whose customers stop noticing they are talking to an agent at all.