Beyond a Thin Wrapper: Building Production-Ready Chatbots on Top of ChatGPT
LLM providers keep shipping features that change how we build chat experiences—streaming responses, function/tool calling, structured outputs, JSON modes, and better moderation/filters. Frameworks and SDKs have matured for frontend streaming and backend orchestration, while vector tooling and caching continue to cut latency and cost. Despite this, many teams still ship a thin “prompt in, text out” wrapper and struggle with reliability, safety, and scale.
This post outlines a pragmatic architecture for going from a simple wrapper to a robust chatbot using:
- Next.js for streaming UX and edge-friendly routing
- Spring Boot for a typed, enforceable API boundary and orchestration
- Redis for fast state/caching/rate limits
- MongoDB for durable chat history and domain data (RAG)
- Optional vector index for retrieval augmentation
What a Simple Wrapper Misses
- No guarantees on structure (parsing brittle free text)
- Context bloat and runaway token usage
- Latency spikes from network + tool calls
- Missing safety rails and PII handling
- Poor observability: hard to debug failures and regressions
- No state model for memory beyond a growing transcript
- Rate limiting, retries, and backpressure handled ad hoc
Reference Architecture
- 
Frontend (Next.js) - Stream assistant tokens to the client via SSE or fetch + ReadableStream
- Route handlers protect API keys on the server side
- UI shows partial tokens, tool progress, cost/latency hints
 
- 
API Gateway (Spring Boot) - One entrypoint for chat events: message, tool request, rating/feedback
- Applies auth, quotas, idempotency, and request validation
- Emits observability spans and structured logs
 
- 
Orchestrator (can live inside the Spring service or as a microservice) - Builds prompts with system + policies + context
- Invokes the model with function/tool schemas for structured outputs
- Handles retries, fallbacks, and content moderation
 
- 
State & Data - Redis: conversation windows, semantic caches, rate limiting, dedupe
- MongoDB: durable chat history, user profiles, domain docs for RAG
- Vector index (optional): embeddings for retrieval augmentation
 
- 
Observability & Ops - Traces for every LLM call and tool step
- Token accounting (prompt vs. completion)
- Evaluation harness for prompts and regressions
 
Minimal Streaming Flow (Frontend)
// Next.js route (server): POST /api/chat
// - Validates input
// - Calls backend gateway
// - Streams tokens to client
export async function POST(req) {
  const { messages, sessionId } = await req.json();
  const response = await fetch(process.env.BACKEND_URL + "/chat", {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ messages, sessionId, stream: true })
  });
  return new Response(response.body, { headers: { "content-type": "text/event-stream" } });
}
Backend Gateway Sketch (Spring)
// Receives chat request, applies quotas, and streams
@PostMapping(value = "/chat", produces = "text/event-stream")
public SseEmitter chat(@RequestBody ChatRequest req) {
  validate(req);
  enforceQuota(req.userId);
  var emitter = new SseEmitter(0L);
  orchestrator.stream(req, chunk -> emitter.send(chunk), error -> emitter.completeWithError(error), () -> emitter.complete());
  return emitter;
}
Prompt and Tool Strategy
- System policy prompts: role, tone, compliance boundaries
- Tool calling for deterministic operations (search, DB lookup, calculations)
- Structured outputs for parsing safety (JSON schema, enums, number ranges)
- Guardrails in the prompt for citations and refusal conditions
- Few-shot examples kept small; prefer retrieval of fresh, relevant docs
Example tool contract:
{
  "name": "fetch_order_status",
  "description": "Get order status by id.",
  "parameters": {
    "type": "object",
    "properties": { "orderId": { "type": "string" } },
    "required": ["orderId"]
  }
}
Retrieval-Augmented Generation (RAG) That Actually Helps
- Chunk domain docs with overlap and stable IDs
- Use a reliable embedding model; store vectors with metadata
- Hybrid search (semantic + keyword) improves recall for numbers and codes
- Send only top-k passages and cite them; keep the window tight
- Cache hit responses in Redis with content/version keys
Data freshness:
- Invalidate embeddings on document updates with versioned keys
- Prefer on-demand embedding for rapidly changing content
Cost, Latency, and Reliability
- Budget tokens per turn (hard cap) and summarize when nearing limits
- Use streaming to improve perceived latency; prefetch likely tools
- Cache expensive tool results and common system prompts
- Retries: exponential backoff with jitter; classify transient vs. fatal
- Fallbacks: smaller model or shorter context when deadlines loom
- Batch embeddings and metadata lookups
Memory: Short-Term vs. Long-Term
- Short-term: windowed conversation state in Redis (n most recent turns)
- Long-term: episode summaries in MongoDB linked to session
- Summarize on thresholds; store structured facts separately (key/value)
- Forgetting policy: decay or pin critical facts
Safety and Compliance
- Pre-call input filter: PII masking for logs, policy checks
- Post-call output filter: toxicity, prompt leakage, and PII detection
- Red-team test sets baked into CI
- Signed audit logs for admin and data access tools
Observability and Evaluation
- Trace spans per step: prompt build, model call, each tool, retrieval
- Log prompt, sampled outputs, tokens, cost, latency, and errors
- Offline evaluation: correctness on fixtures, hallucination checks, safety tests
- Online evaluation: thumbs, comments, task success, containment rate
Minimal event shape for tracing:
{
  "traceId": "...",
  "step": "llm.call",
  "model": "...",
  "tokens": { "prompt": 512, "completion": 178 },
  "latencyMs": 820,
  "cost": 0.0023,
  "status": "ok"
}
Deployment and Scaling
- Frontend: edge runtime for token streaming; fall back to region when tools are heavy
- Backend: autoscale Spring instances; circuit breakers around LLM and vector services
- Tooling: isolate side effects behind queues; idempotency keys for replays
- Rate limiting: fixed-window + token bucket; per-user and per-API key
- Secrets: server-only access; rotate keys and test with least privilege
Implementation Blueprint
- Define conversation schema and message store (MongoDB) and a Redis namespace for sessions, rate limits, and caches.
- Build a streaming route in Next.js; render partial tokens and tool progress.
- Add a Spring Boot gateway that normalizes requests, enforces quotas, and emits SSE.
- Introduce structured outputs and tool schemas for critical actions.
- Add RAG with a vector index; implement citations and caching.
- Instrument tracing, metrics, and token/cost logging.
- Ship a safety layer: input/output filters, red-team tests, content policy prompts.
- Optimize latency and cost with caching, batching, and fallbacks.
- Establish evaluation datasets and automate regression checks in CI.
Common Pitfalls
- Treating the prompt as a monolith instead of modular policies
- Pushing entire chat history each turn (token blowups)
- No idempotency, causing duplicate tool execution
- Logging raw PII and secrets
- Ignoring non-200 responses and provider-specific error classes
- Over-reliance on one provider without graceful fallback
Roadmap Ideas
- Session-specific tool permissions and scoped credentials
- Background agents for long-running tasks with progress events
- Advanced memory with knowledge graphs for durable facts
- Multi-provider abstraction with cost/latency-aware routing
Tags: AI chatbots, ChatGPT, LLM tooling, Next.js, React, Spring Boot, Redis, MongoDB, RAG, function calling, structured outputs, streaming UX, observability, prompt engineering, edge runtime
