Selmir
Back to All Articles
AI Engineering

Building Production-Ready Chatbots on Top of ChatGPT

11 min
Full-Stack Developer
AI chatbotsChatGPTLLM toolingNext.jsSpring BootRedisMongoDBRAGFunction callingStructured outputsStreamingObservabilityPrompt engineeringEdge computingRate limiting

Beyond a Thin Wrapper: Building Production-Ready Chatbots on Top of ChatGPT

LLM providers keep shipping features that change how we build chat experiences—streaming responses, function/tool calling, structured outputs, JSON modes, and better moderation/filters. Frameworks and SDKs have matured for frontend streaming and backend orchestration, while vector tooling and caching continue to cut latency and cost. Despite this, many teams still ship a thin “prompt in, text out” wrapper and struggle with reliability, safety, and scale.

This post outlines a pragmatic architecture for going from a simple wrapper to a robust chatbot using:

  • Next.js for streaming UX and edge-friendly routing
  • Spring Boot for a typed, enforceable API boundary and orchestration
  • Redis for fast state/caching/rate limits
  • MongoDB for durable chat history and domain data (RAG)
  • Optional vector index for retrieval augmentation

What a Simple Wrapper Misses

  • No guarantees on structure (parsing brittle free text)
  • Context bloat and runaway token usage
  • Latency spikes from network + tool calls
  • Missing safety rails and PII handling
  • Poor observability: hard to debug failures and regressions
  • No state model for memory beyond a growing transcript
  • Rate limiting, retries, and backpressure handled ad hoc

Reference Architecture

  • Frontend (Next.js)

    • Stream assistant tokens to the client via SSE or fetch + ReadableStream
    • Route handlers protect API keys on the server side
    • UI shows partial tokens, tool progress, cost/latency hints
  • API Gateway (Spring Boot)

    • One entrypoint for chat events: message, tool request, rating/feedback
    • Applies auth, quotas, idempotency, and request validation
    • Emits observability spans and structured logs
  • Orchestrator (can live inside the Spring service or as a microservice)

    • Builds prompts with system + policies + context
    • Invokes the model with function/tool schemas for structured outputs
    • Handles retries, fallbacks, and content moderation
  • State & Data

    • Redis: conversation windows, semantic caches, rate limiting, dedupe
    • MongoDB: durable chat history, user profiles, domain docs for RAG
    • Vector index (optional): embeddings for retrieval augmentation
  • Observability & Ops

    • Traces for every LLM call and tool step
    • Token accounting (prompt vs. completion)
    • Evaluation harness for prompts and regressions

Minimal Streaming Flow (Frontend)

// Next.js route (server): POST /api/chat
// - Validates input
// - Calls backend gateway
// - Streams tokens to client

export async function POST(req) {
  const { messages, sessionId } = await req.json();
  const response = await fetch(process.env.BACKEND_URL + "/chat", {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: JSON.stringify({ messages, sessionId, stream: true })
  });
  return new Response(response.body, { headers: { "content-type": "text/event-stream" } });
}

Backend Gateway Sketch (Spring)

// Receives chat request, applies quotas, and streams
@PostMapping(value = "/chat", produces = "text/event-stream")
public SseEmitter chat(@RequestBody ChatRequest req) {
  validate(req);
  enforceQuota(req.userId);
  var emitter = new SseEmitter(0L);
  orchestrator.stream(req, chunk -> emitter.send(chunk), error -> emitter.completeWithError(error), () -> emitter.complete());
  return emitter;
}

Prompt and Tool Strategy

  • System policy prompts: role, tone, compliance boundaries
  • Tool calling for deterministic operations (search, DB lookup, calculations)
  • Structured outputs for parsing safety (JSON schema, enums, number ranges)
  • Guardrails in the prompt for citations and refusal conditions
  • Few-shot examples kept small; prefer retrieval of fresh, relevant docs

Example tool contract:

{
  "name": "fetch_order_status",
  "description": "Get order status by id.",
  "parameters": {
    "type": "object",
    "properties": { "orderId": { "type": "string" } },
    "required": ["orderId"]
  }
}

Retrieval-Augmented Generation (RAG) That Actually Helps

  • Chunk domain docs with overlap and stable IDs
  • Use a reliable embedding model; store vectors with metadata
  • Hybrid search (semantic + keyword) improves recall for numbers and codes
  • Send only top-k passages and cite them; keep the window tight
  • Cache hit responses in Redis with content/version keys

Data freshness:

  • Invalidate embeddings on document updates with versioned keys
  • Prefer on-demand embedding for rapidly changing content

Cost, Latency, and Reliability

  • Budget tokens per turn (hard cap) and summarize when nearing limits
  • Use streaming to improve perceived latency; prefetch likely tools
  • Cache expensive tool results and common system prompts
  • Retries: exponential backoff with jitter; classify transient vs. fatal
  • Fallbacks: smaller model or shorter context when deadlines loom
  • Batch embeddings and metadata lookups

Memory: Short-Term vs. Long-Term

  • Short-term: windowed conversation state in Redis (n most recent turns)
  • Long-term: episode summaries in MongoDB linked to session
  • Summarize on thresholds; store structured facts separately (key/value)
  • Forgetting policy: decay or pin critical facts

Safety and Compliance

  • Pre-call input filter: PII masking for logs, policy checks
  • Post-call output filter: toxicity, prompt leakage, and PII detection
  • Red-team test sets baked into CI
  • Signed audit logs for admin and data access tools

Observability and Evaluation

  • Trace spans per step: prompt build, model call, each tool, retrieval
  • Log prompt, sampled outputs, tokens, cost, latency, and errors
  • Offline evaluation: correctness on fixtures, hallucination checks, safety tests
  • Online evaluation: thumbs, comments, task success, containment rate

Minimal event shape for tracing:

{
  "traceId": "...",
  "step": "llm.call",
  "model": "...",
  "tokens": { "prompt": 512, "completion": 178 },
  "latencyMs": 820,
  "cost": 0.0023,
  "status": "ok"
}

Deployment and Scaling

  • Frontend: edge runtime for token streaming; fall back to region when tools are heavy
  • Backend: autoscale Spring instances; circuit breakers around LLM and vector services
  • Tooling: isolate side effects behind queues; idempotency keys for replays
  • Rate limiting: fixed-window + token bucket; per-user and per-API key
  • Secrets: server-only access; rotate keys and test with least privilege

Implementation Blueprint

  1. Define conversation schema and message store (MongoDB) and a Redis namespace for sessions, rate limits, and caches.
  2. Build a streaming route in Next.js; render partial tokens and tool progress.
  3. Add a Spring Boot gateway that normalizes requests, enforces quotas, and emits SSE.
  4. Introduce structured outputs and tool schemas for critical actions.
  5. Add RAG with a vector index; implement citations and caching.
  6. Instrument tracing, metrics, and token/cost logging.
  7. Ship a safety layer: input/output filters, red-team tests, content policy prompts.
  8. Optimize latency and cost with caching, batching, and fallbacks.
  9. Establish evaluation datasets and automate regression checks in CI.

Common Pitfalls

  • Treating the prompt as a monolith instead of modular policies
  • Pushing entire chat history each turn (token blowups)
  • No idempotency, causing duplicate tool execution
  • Logging raw PII and secrets
  • Ignoring non-200 responses and provider-specific error classes
  • Over-reliance on one provider without graceful fallback

Roadmap Ideas

  • Session-specific tool permissions and scoped credentials
  • Background agents for long-running tasks with progress events
  • Advanced memory with knowledge graphs for durable facts
  • Multi-provider abstraction with cost/latency-aware routing

Tags: AI chatbots, ChatGPT, LLM tooling, Next.js, React, Spring Boot, Redis, MongoDB, RAG, function calling, structured outputs, streaming UX, observability, prompt engineering, edge runtime

Found this helpful?

Share it with others who might benefit

Ready to Build Your Project?

Let's discuss how I can help you implement these concepts and build a scalable, high-performance web application tailored to your needs.