Don't Make Them Wait: Improving AI UX with Streaming Thoughts
ai best-practices product web

Don't Make Them Wait: Improving AI UX with Streaming Thoughts

Long LLM inference times can frustrate users. Learn how to use Operational Transparency and Firebase AI Logic to stream "thinking" steps, turning the black box into a glass box and keeping users engaged.

Waiting is hard. Especially waiting for your inference to finish from a large LLM. Google has spoiled us with instantaneous search results for years after running a query and when moving to a very large model that takes time to process requests, the waiting can be even more painful. Sometimes I find myself pondering whether the application hung and entered an error state or whether it’s actually continuing to process information. When a screen freezes, users assume it’s broken. When a screen shows activity, users assume it’s working. We need to turn the LLM black box into a glass box.

The studies

There has been research done at transit stations in London (TfL) that show that having a time table at stops can help empower users to feel less anxious about waiting for the bus to arrive. Anecdotally, regular commuters may know the bus is late, but the confirmation of that delay reduces the cognitive load of uncertainty. This operational transparency gives the system credibility and makes it feel more reliable, even if the bus system in London hasn’t improved its reliability. We can do the same thing with LLMs when they may take a while to process information by capitalizing on their thought signatures that are emitted while processing information. The results may not arrive sooner, but the idea that the system is working through a problem is giving the user peace of mind about what is being done while they are waiting for inference to complete.

Adding thoughts as status updates

Let’s start by using Firebase AI Logic to get these status updates. We start by initializing Firebase AI Logic as we normally would and then we get a reference to our generative model.

Include Thoughts

const model = getGenerativeModel(ai, { 
  model: "gemini-3-flash-preview",
  tools: [{ googleSearch: {} }],
  generationConfig: {
    thinkingConfig: {
      includeThoughts: true,
      thinkingBudget: -1 // Dynamic thinking
    }
  }
});

Here you can see that in our generationConfig, we set a value for the thinkingConfig and to include thoughts. If we do not includeThoughts in our generation config, we won’t be able to provide status updates from the model as the first response from the model will be the generated output.

Stream the thoughts

/**
 * Async Generator that yields structured updates for thoughts and text
 */
export async function* streamWithThoughts(
  ai: AI,
  prompt: string
): AsyncGenerator<StreamUpdate> {
  const model = getReasoningModel(ai);
  const result = await model.generateContentStream(prompt);

  let accumulatedThoughts = "";

  for await (const chunk of result.stream) {
    // 1. Process Thoughts
    const thought = chunk.thoughtSummary?.();
    if (thought) {
      accumulatedThoughts += thought;

      // Extract the most recent thought title wrapped in ** **
      const matches = [...accumulatedThoughts.matchAll(/\*\*(.*?)\*\*/g)];
      const lastStep =
        matches.length > 0 ? matches[matches.length - 1][1] : undefined;

      yield {
        type: "thought",
        content: thought,
        currentStep: lastStep,
      };
    }

    // 2. Process Actual Response Text
    const text = chunk.text();
    if (text) {
      yield {
        type: "text",
        content: text,
      };
    }
  }
}

Now we can use a generator function to go and generate thoughts. As we generate these thoughts we send them to the previous method through a yield function. This means that we are constantly sending updates back to the calling method through yield and can get updates as needed.

In the code snippet, we are using a regex to extract bolded titles from the thought stream. While the model typically follows this pattern, it may change in the future. This is the current pattern that I have examined using Gemini models and it may vary depending on the model provider you are using.

Update the UI

/**
 * Example usage: Updating the DOM with the stream
 */
export async function updateUIWithStream(
  ai: AI,
  prompt: string,
  uiElements: {
    currentThoughtEl: HTMLElement;
    thoughtsHistoryEl: HTMLElement;
    textEl: HTMLElement;
    headerEl: HTMLElement;
  }
) {
  const { currentThoughtEl, thoughtsHistoryEl, textEl, headerEl } = uiElements;
  let assistantThoughts = "";
  let assistantContent = "";

  const stream = streamWithThoughts(ai, prompt);

  for await (const chunk of stream) {
    if (chunk.type === "thought") {
      assistantThoughts += chunk.content;

      if (chunk.currentStep) {
        const currentText = currentThoughtEl.textContent;
        const newText = "Thinking: " + chunk.currentStep;

        if (currentText !== newText) {
          // Wrap in a span to allow CSS animations
          const span = `<span class="thought-text">${newText}</span>`;
          currentThoughtEl.innerHTML = DOMPurify.sanitize(span);
          headerEl.style.display = "flex";
        }
      }

      // Update the full thoughts history
      // (you'd typically use a markdown parser here)
      thoughtsHistoryEl.innerHTML = DOMPurify.sanitize(assistantThoughts);
    } else {
      assistantContent += chunk.content;
      // Update the main response text
      // (you'd typically use a markdown parser here)
      textEl.innerHTML = DOMPurify.sanitize(assistantContent);
    }
  }
}

Finally, we can update the UI by getting those thoughts and using typescript to update the UI. In this contrived example I am setting the innerHTML after passing it through DOMPurify in the event the model accidentally outputs JavaScript, it shouldn’t execute, limiting my exposure to Cross site scripting attacks.

Final result

Here is the final result! We can use CSS to animate new thoughts coming and providing timely status updates to the user based on these new results. Just like the countdown clock at the bus stop, these thought bubbles verify that the system is working hard, transforming the waiting experience from anxious to engaging, giving the user more operational transparency and confidence that the model is working hard while they await the results.

Related Content

ai • Feb 4, 2026

Talking about Skills, Optimizing Prompts and building MCP servers in apps

We’re diving deep into the latest paradigms in AI development, starting with the difference between traditional context files (Gemini.md) and the new "Agent Skills" dynamic. We also share a story about using the Vertex AI Prompt Optimizer to automate our YouTube descriptions. It took 5 hours and nearly 100 million tokens, but the results were surprisingly consistent. Finally, we geek out on the Model Context Protocol (MCP), experimenting with exposing Flutter application state as local tools using Unix sockets.

Watch on YouTube
ai • Jan 26, 2026

Gemini CLI vs. Antigravity: The Battle for Better AI Workflows

In this next episode of our "untitled" podcast, Nohe and Rody take a "tech walk" to discuss the evolving landscape of AI development tools. We dive deep into the differences between the linear workflows of Gemini CLI and the asynchronous, project-level capabilities of Anti-Gravity. We also geek out on home lab setups—discussing the shift from Docker Compose to Kubernetes (K3s) on Raspberry Pi clusters—and share a game-changing workflow using NotebookLM to generate context files for your AI agents. Finally, we explore Stitch for generative UI, including how to instantly create shaders and animations from simple screenshots.

Watch on YouTube