The Native Module That Never Says "Done": Coordinating Async Events, Refs, and React State in a Streaming LLM Hook

A Native Module With No “Done” Signal

Integrating a streaming LLM module into a React hook sounds straightforward until you read the documentation and discover the module never fires a “generation complete” event. It fires one event per token produced and then goes quiet. Your hook has to figure out when it’s done from the silence.

That’s one problem. But there are four more. The hook must:

Accumulate streaming tokens into a growing response string
Support user-initiated cancellation mid-stream
Clean up native event listeners on every possible exit path — completion, cancellation, error, and component unmount
Surface the right React state (generating, idle, error) without triggering redundant re-renders during the streaming hot path

Get any of these wrong and you either leak native memory, leave the UI stuck in a generating state forever, or flood the React reconciler with hundreds of state updates per second.

hooks/useLLM.ts is the hook that drives every LLM interaction in the app. It bypasses the SDK-provided hook entirely and talks directly to the ExpoLlmMediapipe native module. Working through it reveals when to reach for useRef instead of useState, how to write listener cleanup that’s leak-proof across all control flow paths, and how to reason about React’s rendering model when integrating with non-React async systems.

What You Must Already Know: Refs, the Event Loop, and Native Bridge Concepts

Knowledge Base:

useRef vs useState — when each triggers re-renders and why that matters
useCallback and useMemo for stable function references
JavaScript’s event loop, microtask queue, and macrotask queue
The React Native native module event emitter pattern (addListener / remove)
AbortController and the Fetch/async cancellation API
useEffect cleanup functions

Environment (from package.json):

react                       19.1.0
react-native                0.81.5
expo-llm-mediapipe          ^0.6.0
react-native-reanimated     ~4.1.0
typescript                  ~5.9.2

The Hook’s Five States, and Why the Transitions Matter

Think of the hook as an air traffic controller managing a single runway. The model handle (modelHandleRef) is the runway — once acquired, it’s reused for every inference request. Each request gets a unique flight number (requestId). Token events from the native module come over the radio (addListener). When a request finishes (Promise resolves), the controller tears down that request’s radio channel and marks the runway available again. If the tower is destroyed (component unmounts), all active channels are cut regardless of whether a request is in progress.

The states cycle through: idle → generating → back to idle on completion or cancellation, or into error on failure. The initialize flow runs once per model load. Every state transition has a corresponding listener cleanup obligation.

The Five Decisions That Prevent the Bugs

Refs for Mutable Values, State for Render-Triggering Values

The hook opens with a deliberate split:

export function useLLM(): UseLLMReturn {
  // STATE: These values drive UI re-renders — they MUST be state
  const [inferenceState, setInferenceState] = useState<InferenceState>('idle');
  const [error, setError] = useState<string | null>(null);
  const [isReady, setIsReady] = useState(false);
  
  // REFS: These values are needed by async callbacks but should NEVER
  // trigger re-renders when they change. Putting these in state would
  // cause the component to re-render on every token event.
  const modelHandleRef = useRef<number | null>(null);
  const requestIdCounterRef = useRef(0);
  const abortControllerRef = useRef<AbortController | null>(null);
  const partialListenerRef = useRef<NativeModuleSubscription | null>(null);
  const errorListenerRef = useRef<NativeModuleSubscription | null>(null);

During streaming, the native module fires an onPartialResponse event for each token — potentially hundreds of times per second. If modelHandleRef were useState, each assignment inside the event callback would schedule a re-render. At 30 tokens/second, that’s 30 reconciliation passes per second for a value the UI doesn’t even display. useRef stores a mutable box whose .current value can be read by any closure — including async ones and event listeners — without the assignment ever notifying React.

The Unmount Safety Net

Before any generation logic, the hook installs a cleanup effect that runs when the component unmounts:

useEffect(() => {
  return () => {
    if (modelHandleRef.current !== null) {
      // releaseModel is async — fire-and-forget here because we
      // can't await inside a cleanup function.
      ExpoLlmMediapipe.releaseModel(modelHandleRef.current).catch(err => 
        console.error('[useLLM] Error releasing model:', err)
      );
    }
    // Remove event listeners — these hold references to JS closures
    // that in turn hold references to this hook's scope. Without removal,
    // the garbage collector cannot reclaim any of it.
    if (partialListenerRef.current) {
      partialListenerRef.current.remove();
    }
    if (errorListenerRef.current) {
      errorListenerRef.current.remove();
    }
  };
}, []); // Empty deps: run cleanup only on unmount

Request IDs as Event Filters

The native module is a global event emitter — every onPartialResponse event goes to every listener in the app. Without filtering, a second inference request could receive events from a previous one:

const generateResponse = useCallback(async (messages, options, onToken) => {
  // Monotonically increasing counter stored in a ref — no re-render on increment
  const requestId = ++requestIdCounterRef.current;

  partialListenerRef.current = ExpoLlmMediapipe.addListener(
    'onPartialResponse',
    (event: PartialResponseEventPayload) => {
      // DOUBLE FILTER: request ID must match AND model handle must match.
      // The handle check guards against a race where the hook re-initializes
      // with a new model before the old request's events have all fired.
      if (event.requestId !== requestId || event.handle !== modelHandleRef.current) return;

      // Abort check: if stopGeneration() was called, the signal is already
      // aborted. We still receive events until the native side processes the
      // cancellation — silently discard them.
      if (abortControllerRef.current?.signal.aborted) return;

      // Accumulate chunk into the full response string (in closure scope)
      fullResponse += event.response;
      
      // Pass the CHUNK (not the full accumulated string) to the UI callback.
      onToken(event.response);
    }
  );

Awaiting the Promise That Resolves When the Native Side Completes

Unlike a silence-heuristic approach, the hook relies on generateResponseAsync itself resolving when generation is finished:

  // This await BLOCKS until the native module signals completion.
  // While awaiting, the event listener above is firing with individual tokens.
  // Both happen "concurrently" because event callbacks run between
  // microtask queue flushes — the await suspends this function, giving
  // the event loop room to process incoming token events.
  await ExpoLlmMediapipe.generateResponseAsync(
    modelHandleRef.current,
    requestId,
    prompt
  );

  // Execution resumes here AFTER all token events have fired.
  // fullResponse now contains the complete generated text.
  
  if (partialListenerRef.current) {
    partialListenerRef.current.remove();
    partialListenerRef.current = null;
  }

  setInferenceState('idle');
  return fullResponse;

The Cleanup-on-Error Guarantee

The catch block must mirror the happy-path cleanup exactly. The most common source of listener leaks is a cleanup path that only runs on success:

  } catch (err) {
    // This block runs for: network errors, OOM, model corruption, timeouts.
    // ALL of them must clean up listeners.
    if (partialListenerRef.current) {
      partialListenerRef.current.remove();
      partialListenerRef.current = null;
    }
    if (errorListenerRef.current) {
      errorListenerRef.current.remove();
      errorListenerRef.current = null;
    }
    
    setError(err instanceof Error ? err.message : 'Failed to generate response');
    setInferenceState('error');
    throw err; // Re-throw so the caller (useChat) can handle it
  }

Why 30 Tokens Per Second Doesn’t Jank the UI

When generateResponseAsync is awaited, the JavaScript engine suspends the generateResponse function and yields back to the event loop. Native token events arrive as macrotasks (via the native-to-JS bridge). The event loop processes them one at a time, calling the onPartialResponse callback for each.

Inside the callback, fullResponse += event.response updates a closure variable, not React state. No re-render is scheduled. Only onToken(event.response) crosses back into the caller’s concern — and in useChat, that callback calls setStreamingMessage(prev => prev + token), which does call setState.

React 19’s automatic batching means multiple setState calls within the same macrotask are batched into a single re-render. This is why streaming feels smooth: React consolidates the token updates rather than re-rendering for each one.

The event listener closure captures requestId, onToken, and the fullResponse mutable variable. As long as the listener is attached, none of these can be garbage collected. For a 1000-token response, fullResponse grows to roughly 4-6 KB — trivial. The real risk is the listener staying attached indefinitely if cleanup is missed, keeping that entire closure scope alive.

Four Ways This Can Break

The “ghost listener” race condition. If generateResponse is called again before the first call’s cleanup runs, two listeners for different requestId values will be active simultaneously. The requestId filter prevents cross-contamination, but both listeners are attached. A defensive implementation would check if (inferenceState === 'generating') throw new Error('Already generating') at the top of generateResponse — which LLMService does with its isGenerating boolean guard. The hook doesn’t currently have this guard.

Component unmount during active generation. If the component unmounts while await generateResponseAsync is suspended, the useEffect cleanup runs and removes the listeners. The native module continues generating, but events are silently discarded. In React’s StrictMode (development), effects run twice, which can cause the cleanup to fire prematurely and leak the second listener registration.

Native module handle mismatch. If initialize() is called again while a generation is in progress, modelHandleRef.current is overwritten with a new handle number. The existing listener’s double-filter (event.handle !== modelHandleRef.current) would then reject all events from the in-flight request, leaving generateResponse awaiting forever. Always call cleanup() before calling initialize() again.

AbortController without native-side cancellation. stopGeneration() sets the abort signal, causing the listener to silently discard future events. The native module is not told to stop — it keeps generating and firing events (which are now discarded). This wastes battery and CPU until the native module naturally completes. A production implementation would call a native cancelGeneration(requestId) method if the module exposes one.

The pattern here — refs for mutable state the async hot path needs, state for values that drive rendering, per-request IDs as event filters, and symmetric cleanup in every exit path — applies directly to any React Native integration with a streaming native module: audio processing, video frame pipelines, Bluetooth device events, or any native system that produces a stream of events without an explicit termination signal.