Most Shopify storefronts have a conversion gap that nobody really talks about. A shopper lands on a product page, has a question (will this fit?, does it ship to my country?, do you have it in black?), and there is nobody there to answer. The default chat widget is usually a glorified "leave us a message" form, which is fine for support tickets but useless for a buying decision that has a five-minute window before the tab closes.
I wanted something that could actually behave like a salesperson on the floor. Greet the shopper, search the catalog, pull up an order, push a code when it makes sense, and quietly tag the lead for follow-up if they bounce. That became Chatflo, a Shopify app I shipped on Claude Sonnet 4.6 using the Vercel AI SDK and Cloud Run.
This post is a tour of the parts that mattered.
The shape of the stack
The whole thing is a single React Router app (forked from Shopify's React Router template), running in a container on Cloud Run, talking to:
- Claude Sonnet 4.6 via
@ai-sdk/anthropicfor the agent loop, and Haiku 4.5 for cheap follow-up suggestions. - Shopify Admin GraphQL for product search, order lookups, and minting real discount codes.
- Postgres + Prisma for chatbot config, conversation state, leads, and per-shop usage counters.
- Server-Sent Events (SSE) to stream tool calls and token deltas to a theme app extension on the storefront.
The interesting part isn't the framework choice. It's how the agent loop, the Shopify tools, and the streaming surface all fit together without melting the bill or the latency budget.
The agent loop is just streamText with stop conditions
A quick note on naming, because this trips people up: the "AI SDK" I keep
referring to is Vercel's AI SDK (the ai package),
not Anthropic's
Claude Agent SDK.
They solve overlapping problems from different ends. Vercel's SDK is a
provider-agnostic streaming primitive: you bring your own tools and your
own loop, and it gives you a clean stream of tool calls and text deltas.
Anthropic's Agent SDK is "Claude Code as a library": opinionated, batteries
included (file I/O, bash, MCP, subagents), and excellent if you're building
something Claude-Code-shaped. For a shopper-facing chatbot with custom
Shopify tools and no filesystem in sight, Vercel's SDK was the cleaner fit.
I went back and forth between writing a hand-rolled tool loop on top of the
Anthropic SDK and just using the AI SDK's streamText. The AI SDK won
because it gives you three things that are tedious to build correctly:
multi-step tool execution, a normalized stream you can fan out, and
provider-agnostic prompt caching hooks.
The core of runAgent is roughly this:
import { streamText, stepCountIs, hasToolCall } from "ai";
import { aiAnthropic, DEFAULT_AGENT_MODEL } from "./ai.server";
const result = streamText({
model: aiAnthropic(DEFAULT_AGENT_MODEL),
messages: [systemMessage, ...withMessageCaching(history, model)],
tools: withToolCaching(tools, model),
temperature: chatbot.temperature,
maxOutputTokens: chatbot.maxTokens,
stopWhen: [
stepCountIs(MAX_STEPS),
hasToolCall("show_lead_form"),
hasToolCall("soft_capture_email"),
],
});stopWhen is the part I wish I'd discovered sooner. By default the SDK will
keep looping until the model stops calling tools, which for a sales agent
is exactly not what you want when a UI tool fires. If the model calls
show_lead_form, the next step shouldn't be more text; the conversation
should pause and wait for the shopper to submit the form. hasToolCall lets
you express that as data instead of as a hand-written state machine.
The same pattern works as a safety rail. stepCountIs(5) caps a runaway
tool loop at five steps, which is plenty for "search → narrow → recommend →
add to cart" and short enough that a confused model can't burn through a
shop's monthly token cap in a single turn.
Tools are where the product actually lives
The model on its own is a chatbot. The tools are what make it a salesperson. Chatflo registers ~10 tools per conversation, gated by what the merchant has turned on:
search_products,recommend_products,browse_collections— read-only catalog access.add_to_cart— returns a payload the storefront client uses to call/cart/add.js.lookup_order,get_my_orders,get_customer_profile— only registered when the shopper is signed in (verified via Shopify App Proxy HMAC, more on that below).show_lead_form,soft_capture_email— UI tools that halt the loop and ask the shopper for an email.give_discount— mints a real, single-use Shopify discount code viadiscountCodeBasicCreate.request_human_handoff— flags the conversation for the merchant's inbox.
Each tool is a small object with a Zod schema and an execute function:
const giveDiscount: ToolDefinition = {
name: "give_discount",
description:
"Issue a discount code to the shopper. Gated: only call when the shopper has hesitated about price OR has been engaged for 5+ turns without converting, AND an email has been captured. Never offer unprompted on the first message.",
schema: z.object({
offer_id: z.string().optional(),
}),
execute: async (args, ctx) => {
const offers = ctx.sales?.offers ?? [];
if (offers.length === 0) {
return {
ok: false,
error: "no_offers_configured",
summary: "No offers configured",
};
}
const hasEmail = await emailCapturedForConversation(ctx.conversationId);
if (!hasEmail) {
return {
ok: false,
error: "gate_no_email",
summary: "Discount blocked — capture an email first",
};
}
const target = offers[0];
const minted = await mintDynamicDiscountCode(ctx, target);
// ... persist conversion event, return the live code
},
};Two things worth pulling out from this:
The first is that the gates live in the tool, not in the prompt. You can
write "only offer a discount after capturing an email" into the system
prompt and the model will mostly listen, but "mostly" is not what you want
when the alternative is bleeding margin. By returning gate_no_email from
the tool itself, the model gets a clear, machine-readable signal that the
offer can't go out yet, and the tool can never fire prematurely no matter
how the prompt drifts.
The second is that give_discount calls Shopify's
discountCodeBasicCreate mutation and returns a real, redeemable code. The
shopper sees CHAT-7K2P9X in the chat, copies it to checkout, and Shopify
honors it because it actually exists. That sounds obvious in retrospect, but
the lazy version (let the model invent a code and pray) is what most AI
chatbot demos ship with.
The prompt is mostly a wrapper around merchant config
The system prompt is built per-turn from the merchant's onboarding config: business profile, tone, FAQs, configured offers, objection-handling notes, and social proof quotes. This is also where the live page context goes — the URL the shopper is on, the current product, what's in their cart — so the model doesn't have to ask "what are you looking at?" when it's literally in the request payload.
I leaned hard on Anthropic's prompt caching to keep this affordable. The AI SDK exposes provider options on every message, which means you can mark specific blocks for caching:
const systemMessage: ModelMessage = {
role: "system",
content: system,
providerOptions: { anthropic: { cacheControl: { type: "ephemeral" } } },
};There's a four-breakpoint limit per request. I use three: the system prompt, the last tool definition (which caches the entire tool block), and the last message in history (which caches the conversation prefix incrementally as turns accumulate). The helpers look like this:
export function withMessageCaching(
messages: ModelMessage[],
model: LanguageModel
): ModelMessage[] {
if (!isAnthropicModel(model)) return messages;
if (messages.length === 0) return messages;
return messages.map((message, index) =>
index === messages.length - 1
? {
...message,
providerOptions: {
...message.providerOptions,
...ANTHROPIC_CACHE,
},
}
: message
);
}On a typical 4-turn conversation with ~3k tokens of system prompt and tool definitions, this drops the per-turn input cost by roughly 80%. The 5-minute TTL is short, but a shopper who is actively typing trips it on every turn, which is exactly the case you want to optimize for.
Streaming through Shopify's App Proxy
The storefront chat widget can't talk to my Cloud Run instance directly.
Shopify proxies the request through /apps/chatflo/chat so it stays on the
merchant's domain. Every proxied request is signed with HMAC, and the
logged_in_customer_id query param is only trustworthy after you verify
that signature.
export const action = async ({ request }: ActionFunctionArgs) => {
const url = new URL(request.url);
const secret = process.env.SHOPIFY_API_SECRET ?? "";
if (!verifyAppProxySignature(url, secret)) {
return unauthorized("bad_signature");
}
const { shop, loggedInCustomerId } = readAppProxyContext(url);
// ... safe to use loggedInCustomerId for order lookups now
};Once the request is verified, the agent runs as an async generator and each event is pushed down the wire as SSE:
export function agentEventStream(
gen: AsyncGenerator<unknown>
): ReadableStream<Uint8Array> {
const encoder = new TextEncoder();
return new ReadableStream({
async start(controller) {
try {
for await (const ev of gen) {
const line = `data: ${JSON.stringify(ev)}\n\n`;
controller.enqueue(encoder.encode(line));
}
} finally {
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
}
},
});
}The events are typed: text_delta, tool_start, tool_end, tool_data,
ui_component, suggestions, done, error. The widget renders them
differently: tool starts become little badges ("searching products..."),
tool_data for search_products becomes a swipeable product carousel,
ui_component swaps the input out for an inline lead form. Everything is
incremental, so the shopper sees the agent doing things rather than
staring at a typing indicator while the model burns through a tool loop in
the background.
Cloud Run was the boring, correct choice
I tried Vercel first. It works, but a streaming agent with 30-60 second tool loops fights the serverless model. Cold starts compound with first-token latency, and you end up paying for connection time you didn't budget for. Cloud Run gives you a long-lived container, generous concurrency per instance, and a simple Dockerfile contract.
The Dockerfile is a stock multi-stage Node 22 build:
FROM node:22-alpine AS deps
RUN apk add --no-cache openssl
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --legacy-peer-deps
FROM node:22-alpine AS build
RUN apk add --no-cache openssl
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npx prisma generate
RUN npm run build
FROM node:22-alpine AS runtime
WORKDIR /app
ENV NODE_ENV=production
COPY package.json package-lock.json ./
RUN npm ci --omit=dev --legacy-peer-deps && npm cache clean --force
COPY --from=build /app/build ./build
COPY --from=build /app/node_modules/.prisma ./node_modules/.prisma
COPY --from=build /app/node_modules/@prisma ./node_modules/@prisma
ENV PORT=8080
EXPOSE 8080
CMD ["npx", "react-router-serve", "./build/server/index.js"]A few things I learned the hard way:
opensslis not innode:22-alpineby default. Prisma needs it; the build silently produces a broken client without it.- Cloud Run injects
$PORT.react-router-servealready respects it, but if you hardcode 3000 anywhere, the health check will hang and the deploy will roll back with a vague error. - Set min instances to 1 for the chat path. Cold starts on the agent route are painful. The shopper is right there, watching. The cost of one always-on instance is a rounding error compared to losing a sale because the first token took eight seconds.
- Concurrency 80 works fine. The agent is mostly I/O-bound (Anthropic, Shopify GraphQL, Postgres) so a single small instance handles a lot of conversations.
Deploys are gcloud run deploy chatflo --source . from CI. Secrets live in
Secret Manager and get mounted as env vars.
What I'd do differently
Three things, in order of regret:
Start with the eval harness, not the agent. I spent two weeks tweaking prompts before I had a way to measure whether tweaks actually made the agent better. A lightweight harness that replays 50 canned conversations and diffs tool-call sequences would have saved me from a lot of vibes-based prompt engineering.
Treat the conversation state machine as a first-class concept. The
agent's stage — greeting, qualifying, recommending, closing — ended
up scattered across the prompt, the tool gates, and an intent classifier. I
should have modeled it explicitly from day one and let the agent transition
through it via a tool, the way the lead-form halt already works.
Cache the tool block aggressively from the start. I left ~40% on the table for the first month because I assumed prompt caching only mattered for the system prompt. The tool definitions are usually the largest stable prefix in an agent request. Cache them.
Closing
The thing that surprised me most about building on Claude through the AI
SDK is how little glue code there ended up being. The agent loop is a
single streamText call. The "is this a real salesperson" feel comes from
the tools and the gates, not from clever prompting. Once the tool surface
is right, the model is mostly trying to be helpful in the direction you've
already pointed it.
If you're looking at a similar build — an embedded agent on someone else's platform, with real side effects and real money on the line — the parts that took the most thought were the boring ones: HMAC verification, idempotent single-flight per conversation, PII redaction before persisting, gating discounts behind email capture. The model is the easy part now. The infrastructure that makes it safe to deploy is where the work is.
Chatflo is live on the Shopify App Store. If you're a merchant, you can install it. If you're building something similar and want to compare notes, my DMs are open.