Building AI Agents That Know
When to Step Back
The most dangerous AI support agent isn't one that gets things wrong. It's one that doesn't know it's getting things wrong — and keeps confidently going anyway. In AP/AR and fintech environments, that distinction isn't philosophical. It's a compliance risk, a vendor relationship, and sometimes a six-figure payment error waiting to happen.
This is a guide to building AI agents for customer support that are genuinely useful without being reckless. I'll walk through the architecture, show you working code, and argue for something the AI hype cycle tends to skip: the escalation to a human is a feature, not a failure mode.
Why I wrote this from a CS perspective
I've spent two decades in customer-facing roles in fintech and B2B SaaS. I've been on the receiving end of bad automation — the chatbot that confidently tells a controller that her ACH run failed "due to a system error" when the real issue is a misconfigured bank connector that needs a human to fix it. The damage there isn't just the wrong answer. It's the erosion of trust that happens when a customer realizes the system that was supposed to help her just wasted 40 minutes of her month-end close.
Good AI support agents aren't about deflection rate. They're about resolution quality. And resolution quality in high-stakes environments requires knowing exactly when to stop and hand off — with full context intact — to a person who can actually help.
First-call resolution isn't just "did the bot answer." It's "did the customer leave with their problem solved." Sometimes that means the agent's best move is a fast, graceful handoff.
The architecture in plain terms
Before any code, here's the mental model. An AI support agent for a financial operations product needs to do five things reliably:
- 1
Classify intent and urgency. What is the customer actually asking? A "my payment didn't go through" message could be a user error, a platform bug, or a compliance incident. These need different responses.
- 2
Retrieve relevant knowledge. Pull the right context — documentation, past tickets, known issues — and inject it into the prompt so the LLM answers from facts, not training data alone.
- 3
Generate a response and score its own confidence. Have the model rate how certain it is. Low confidence isn't a bug — it's signal. A well-prompted LLM is a reliable judge of its own uncertainty.
- 4
Route based on confidence and tier. High confidence + low severity = answer and close. Low confidence or high severity = escalate immediately with a clean summary for the human taking over.
- 5
Log everything. Every conversation, classification, confidence score, and escalation decision becomes training data and a QA signal. This is how the system gets better over time.
The part everyone skips: intent classification
Most tutorials jump straight to "call the API and return the response." That works for a demo. In production, you need to know what kind of request you're dealing with before you generate anything. Get this wrong and you're routing compliance incidents to your FAQ bot and sending confused users to your escalation queue.
Here's the classifier prompt I use for an AP/AR context. It returns structured JSON that drives all the downstream routing logic:
// The classifier runs BEFORE the main response generation
// It returns structured data, not a chat response
const classifyIntent = async (message) => {
const response = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "claude-sonnet-4-20250514",
max_tokens: 256,
system: `You are an intent classifier for a fintech AP/AR support system.
Return ONLY valid JSON:
{
"tier": 1|2|3|4,
"category": "payment_failure"|"reconciliation"|"access"|"compliance"|"general",
"urgency": "low"|"medium"|"high"|"critical",
"sentiment": "neutral"|"frustrated"|"distressed",
"confidence": 0.0-1.0,
"summary": "one sentence"
}
T1: how-to, access. T2: payment delays, ERP errors. T3: funds errors, compliance. T4: outage, breach.
Return ONLY the JSON.`,
messages: [{ role: "user", content: message }]
})
});
const data = await response.json();
return JSON.parse(data.content[0].text);
};
The key insight: by separating classification from response generation, you get a clean decision point. The classifier output determines whether you even attempt an AI response, or route immediately to a human queue with the classification as context.
Confidence scoring: the mechanism that makes handoffs smart
Once you've classified the intent and retrieved relevant knowledge, you run response generation — but with a second ask baked in: rate your own confidence.
This isn't magic. LLMs are reasonably good at knowing when they're on solid ground versus when they're extrapolating. A question about resetting two-factor authentication? The model either has that in its knowledge base or it doesn't, and it usually knows which. A question about why a specific ACH transaction returned an R29 code and whether it triggers a NACHA reporting obligation? Uncertain territory — and a well-prompted model will tell you so.
In testing across AP/AR support scenarios, a confidence threshold of 0.72 worked well as the auto-respond cutoff. Below that, escalation with the AI-generated summary consistently outperformed attempting a low-confidence answer — customers preferred "I'm getting a specialist who can confirm this" over a hedged non-answer.
The escalation handoff: what it looks like in practice
When the agent escalates, it doesn't just punt. It passes a structured summary to the human queue: the classified intent, urgency, sentiment, the conversation so far, and a plain-language description of why it escalated. The human CSM picks up the ticket already knowing what they're dealing with.
This is the piece most AI support demos skip entirely. They show the bot answering. They don't show what happens when the bot can't answer — and that's exactly what gets a fintech company in trouble at 11pm before a payment run.
Working demo — try it yourself
The demo below uses the Claude API to run a real intent classification and response pipeline against a simulated AP/AR knowledge base. Try the sample messages or type your own. Watch how the tier classification and confidence score change — and notice when it escalates versus resolves.
What to build next
This demo shows the core loop. Production readiness requires a few more layers:
A real knowledge base. Replace the simulated KB with a vector store (Supabase pgvector is a good starting point) populated with your actual documentation, resolved ticket history, and known-issue logs. The retrieval step is what separates a useful agent from a hallucinating one.
A human queue with structured handoff data. Wire the escalation path to your actual support system — Zendesk, Intercom, Linear, whatever you use. The AI's classification and summary should auto-populate the ticket so the human CSM is never starting from scratch.
An evaluation loop. Log every interaction. Tag escalations by whether the human agreed the escalation was warranted. Feed misclassifications back into your system prompt and threshold calibration. This is how you get from 70% first-call resolution to 85%.
Compliance guardrails. For fintech specifically: add a hard override that routes any message containing keywords like "duplicate payment," "NACHA," "audit," or "SOX" directly to Tier 3 regardless of the model's classification. Don't let a confident LLM talk a compliance incident down to a how-to question.
The thing I want you to remember
AI support agents are not a headcount reduction strategy dressed up in friendly UI. The best implementations I've seen treat the AI as the first line of a triage system, not a replacement for the humans who understand the domain, carry the relationships, and can make judgment calls that no model should be making alone.
The measure of a good agent is not how rarely it escalates. It's how well it escalates when it needs to — fast, with full context, without making the customer feel like they've been passed off.
That's a CS standard, not a technology standard. It just happens to require technology to implement at scale.
I'd genuinely love to hear how it goes — especially if you're working in payments, AP/AR, or regulated financial operations. Reach me at sonya.freeney@gmail.com or LinkedIn.