How streaming LLM tokens to browsers enhances real-time AI report generation

Many AI applications stall user experience by delivering entire LLM responses at once, leaving users staring at blank screens. By streaming tokens in real time—just like ChatGPT—applications can transform static waits into dynamic, self-writing reports that build credibility and keep users engaged. The technical challenge lies in bridging the model’s streaming output with browser rendering while managing 40-second generation windows, cancellations, and errors. Here’s how a production-ready setup in Next.js 15 handles it.

Why streaming tokens beats static responses

A full LLM report may take 15 to 40 seconds to generate. When the response arrives at once, users see nothing until completion—creating uncertainty and impatience. Streaming tokens as they’re produced turns that wait into a visible, interactive process. Each token appears in real time, mirroring the experience of conversational AI assistants. This subtle shift transforms user perception from frustration to transparency, without changing the underlying latency.

The implementation relies on Server-Sent Events (SSE), but with a twist. Instead of a few progress updates, the server forwards hundreds of text fragments directly from the model. The transport is the same—SSE over a fetch stream—but the source of data, parsing logic, and failure handling differ entirely from a simple progress bar.

Why `EventSource` isn’t enough for LLM streaming

At first glance, the browser’s built-in EventSource API seems ideal for SSE. It handles automatic reconnection and simplifies event listening. However, it only supports GET requests, while LLM streaming often requires POST with a request body containing prompts or model configurations. Using fetch with response.body.getReader() provides the flexibility to send custom headers (such as authentication tokens) and expose an AbortController for cancellation—capabilities EventSource lacks.

| Feature | EventSource | fetch + Reader | |---|---|---| | POST with body | No | Yes | | Custom headers (e.g., auth) | No | Yes | | Manual cancellation | Awkward | AbortController | | Auto-reconnect | Yes | You must implement it |

For LLM requests, auto-reconnect is undesirable. A reconnect would restart generation, leading to duplicated API calls and unnecessary billing. Therefore, manual control over the stream is essential to avoid costly mistakes.

Consuming the model’s stream: turning raw data into clean tokens

Both Ollama and Claude support streaming responses. Ollama exposes an OpenAI-compatible endpoint at /v1/chat/completions. With stream: true, it returns SSE-formatted lines like data: {json}\n\n, ending with data: [DONE]. The server acts as both an SSE client (reading from the model) and an SSE server (writing to the browser).

The following helper converts the model’s HTTP stream into an async iterator of plain text tokens:

// lib/stream-model.ts
interface ChatChunk {
  choices: { delta: { content?: string } }[];
}

export async function* streamModel(
  prompt: string,
  contract: string,
  signal: AbortSignal,
): AsyncGenerator<string> {
  const response = await fetch(
    "
    {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      signal,
      body: JSON.stringify({
        model: "qwen2.5-coder:7b",
        stream: true,
        messages: [
          { role: "system", content: prompt },
          { role: "user", content: contract },
        ],
      }),
    },
  );

  if (!response.ok || !response.body) {
    throw new Error(`Model request failed: ${response.status}`);
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n");
    buffer = lines.pop() ?? "";

    for (const line of lines) {
      const trimmed = line.trim();
      if (!trimmed.startsWith("data: ")) continue;

      const payload = trimmed.slice(6);
      if (payload === "[DONE]") return;

      const chunk = JSON.parse(payload) as ChatChunk;
      const token = chunk.choices[0]?.delta?.content;
      if (token) yield token;
    }
  }
}

A key detail is buffer management. TCP chunks don’t respect message boundaries, so a single read() might contain half of a data: line. Splitting on \n and preserving the last fragment ensures the parser never receives a truncated JSON object—preventing silent failures.

Claude’s streaming format differs (using the Anthropic SDK with content_block_delta events), but the server’s role remains the same: converting an async stream of text deltas into a format the client can render immediately. Swapping the generator’s body is sufficient; the rest of the pipeline adapts seamlessly.

Building a streaming route handler in Next.js 15

The final step wraps the token generator in a ReadableStream and exposes it via a route handler. Each token becomes a distinct SSE event, enabling the client to render it the moment it arrives.

// app/api/report/route.ts
import { NextRequest } from "next/server";
import { streamModel } from "@/lib/stream-model";

export const runtime = "nodejs";
export const maxDuration = 60;

interface TokenEvent {
  type: "token";
  text: string;
}

interface DoneEvent {
  type: "done";
}

interface ErrorEvent {
  type: "error";
  message: string;
}

type ReportEvent = TokenEvent | DoneEvent | ErrorEvent;

export async function POST(request: NextRequest) {
  const { contract } = (await request.json()) as { contract: string };

  if (!contract?.trim()) {
    return Response.json({ error: "No contract" }, { status: 400 });
  }

  const encoder = new TextEncoder();
  const stream = new ReadableStream({
    async start(controller) {
      function send(event: ReportEvent) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify(event)}\n\n`),
        );
      }

      try {
        for await (const token of streamModel(
          SYSTEM_PROMPT,
          contract,
          request.signal,
        )) {
          send({ type: "token", text: token });
        }
        send({ type: "done" });
      } catch (err) {
        if (request.signal.aborted) return; // User navigated away
        const message = err instanceof Error ? err.message : "Generation failed";
        send({ type: "error", message });
      } finally {
        controller.close();
      }
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache, no-transform",
      Connection: "keep-alive",
      "X-Accel-Buffering": "no",
    },
  });
}

Three design choices ensure reliability:

`request.signal` propagates end-to-end. The browser’s abort triggers request.signal, which travels through Next.js into the model’s fetch, allowing clean cancellation without orphaned processes.
Explicit event types (`token`, `done`, `error`) simplify client parsing. The client receives structured messages instead of raw text, making UI updates predictable and debuggable.
Headers like `X-Accel-Buffering: no` prevent buffering. This ensures the server flushes chunks immediately, critical for real-time rendering.

The future of real-time AI streaming

As AI models become faster and more interactive, streaming token delivery will move from a novelty to a standard. The technical patterns here—async generators, SSE, and signal propagation—will remain foundational. Developers who master these integrations can deliver AI experiences that feel instantaneous, even when generation takes seconds. The next step? Combining streaming with client-side caching and partial state preservation to create resilient, low-latency AI interfaces that work reliably across unreliable networks.

AI summary

Learn how to stream LLM tokens to browsers using Next.js 15 and SSE. Avoid blank screens with real-time AI report generation and efficient error handling.

How streaming LLM tokens to browsers enhances real-time AI report generation

Why streaming tokens beats static responses

Why `EventSource` isn’t enough for LLM streaming

Consuming the model’s stream: turning raw data into clean tokens

Building a streaming route handler in Next.js 15

The future of real-time AI streaming

Comments

DevOps Shorts Thrive on Relatable Engineering Fails

Local-first AI security scanner detects risks in agent workflows

SEO content lead dives into developer workflows for better SaaS writing

How streaming LLM tokens to browsers enhances real-time AI report generation

Why streaming tokens beats static responses

Why EventSource isn’t enough for LLM streaming

Consuming the model’s stream: turning raw data into clean tokens

Building a streaming route handler in Next.js 15

The future of real-time AI streaming

Comments

DevOps Shorts Thrive on Relatable Engineering Fails

Local-first AI security scanner detects risks in agent workflows

SEO content lead dives into developer workflows for better SaaS writing

Why `EventSource` isn’t enough for LLM streaming