The Case for Running AI Locally
Most AI-powered wellness apps treat sensitive user data as an acceptable trade-off for capability. Your moments of anxiety, grief, or vulnerability get transmitted to a remote server, logged, potentially used for training, and subject to data breaches you’ll never hear about. For users navigating genuine emotional difficulty, that trade-off is corrosive to trust in a way that’s hard to articulate but easy to feel.
There’s also the availability problem. The moments when someone most needs a grounding exercise — on a flight, in a rural area, during a network outage — are exactly the moments a cloud-dependent app goes dark. I built this app to be both private-by-design and always-available. That meant running the entire AI pipeline on the device.
What the App Actually Does
Users hold multi-session conversations with a coaching persona grounded in both Buddhist and Stoic traditions. Each conversation is stored locally and retrievable from a history view, with auto-generated titles and session previews so past contexts are never lost.
The coach’s replies stream word-by-word in real time, which mimics the natural cadence of conversation and dramatically reduces the perceived latency on a resource-constrained device. The prompt engine adjusts the coach’s posture based on detected user needs: topic emphasis (anxiety, focus, grief, relationships), time-of-day context, and the user’s self-reported emotional state. Pre-built coaching prompts let users immediately launch structured exercises — breathing practice, reflection prompts, body scans — without knowing how to phrase a request.
Why Every Choice Points Back to the Device
| Layer | Technology |
|---|---|
| Framework | React Native via Expo (managed + EAS builds) |
| Navigation | Expo Router (file-system routing, drawer + tabs) |
| On-Device LLM | expo-llm-mediapipe (Google MediaPipe, Gemma 1B ~1.5 GB quantized) |
| Storage | react-native-mmkv (synchronous, C++ key-value store) |
| UI | NativeWind (Tailwind CSS) + Gluestack UI component library |
| Animations | React Native Reanimated + Legendapp Motion |
| State | React Context + custom hooks (no Redux/Zustand) |
| Testing | Jest with react-native preset, @testing-library/react-native |
| Build/Deploy | EAS (Expo Application Services) |
Using expo-llm-mediapipe over a remote API was the only architecturally coherent choice given the product thesis. MediaPipe’s on-device inference pipeline, combined with a quantized Gemma 1B model, delivers acceptable latency on mid-range Android hardware while keeping the binary footprint manageable. The trade-off — a ~1.5 GB one-time model download — is consciously surfaced in a dedicated onboarding flow rather than hidden from the user.
I chose react-native-mmkv over AsyncStorage because chat history retrieval and settings reads happen on the main thread during navigation transitions. AsyncStorage’s asynchronous, JS-bridge-dependent I/O introduces frame drops at exactly those moments. MMKV’s synchronous C++ implementation eliminates that class of jank. For an app whose UX depends on feeling calm and responsive, this isn’t a micro-optimization.
Expo Router plus React Context rather than a global state library was an easy call. The screen graph is shallow (drawer → chat, history, settings, model-setup). Redux or Zustand would add unnecessary indirection. Two root-level React Contexts (LLMContext, AppInitializationContext) provide global LLM state without prop drilling, while each screen composes purpose-built hooks. The architecture matches the actual complexity of the problem.
Taming a Native Streaming API That Doesn’t Know When It’s Done
The most technically subtle challenge in the project lives inside the LLM inference layer. The expo-llm-mediapipe native module fires incremental partial-response events as the model generates tokens, but it doesn’t emit an explicit “generation complete” signal. From JavaScript’s perspective, a silent event bus is indistinguishable from a model that has genuinely finished generating versus one that is simply pausing mid-thought.
I built a self-healing completion heuristic around this. A periodic check monitors the timestamp of the most recently received token. If no new token arrives within a two-second window, the system treats generation as complete, resolves the pending response promise, and commits the full streamed content to the message store. This interval-based observer runs alongside a global timeout guard and an abort controller — three independent levers to cleanly terminate a generation without leaving the UI hung. The entire flow is wrapped in retry logic that distinguishes retriable transient failures from terminal errors (out-of-memory, model not initialized), so the app degrades gracefully rather than crashing silently.
What Building This Taught Me
Privacy is an architecture decision, not a checkbox. The choice to run inference on-device cascades into nearly every other technical decision — the storage engine, the model format, the build pipeline, the UX of the onboarding flow. Committing to a privacy constraint early forced more rigorous thinking than any feature requirement would have.
The gap between “works on my machine” and “ships on device” is where the real engineering lives. React Native’s native module ecosystem is mature but unforgiving. The moment a library requires a custom dev client — as expo-llm-mediapipe and react-native-mmkv both do — the entire local development and CI/CD workflow has to be reconsidered. EAS build profiles, simulator vs. physical device testing strategies, and native module mocking for unit tests all become first-class concerns.
Streaming UX is a product feature disguised as an engineering problem. The decision to stream tokens to the screen rather than wait for a complete response wasn’t just a latency optimization — it fundamentally changes how the interaction feels. A 4-second wait for a complete response feels slow. The same 4 seconds spent watching text appear feels like thinking. That psychological dimension is as important as the technical implementation.
Where This Goes Next
The architecture already abstracts the sensing layer in anticipation of real-time posture and gesture detection. The highest-value next step is wiring actual MediaPipe pose landmark data into the prompt engine, so the coach can respond to detected physical cues — slumped posture, shallow breathing patterns, prolonged stillness — without the user needing to self-report their state.
Separate Android (NNAPI / GPU delegate) and iOS (Core ML / Metal) inference paths would unlock hardware acceleration on flagship devices and push response latency into the sub-two-second range that makes conversation feel truly natural. The current pipeline targets a generic quantized format that leaves performance on the table.
The data for a longitudinal coaching profile already exists in the local message store. A lightweight on-device analytics pass — tracking recurring topics, session frequency, and emotional patterns over time — would let the coach proactively reference past conversations and adapt its style to individual users. That’s the difference between a stateless chatbot and a tool someone actually returns to.
Try It Out
Check out the source code on GitHub.