featured image

Your API Reloads the Model on Every Request: Here's the FastAPI Pattern That Fixes It for Good

An ML inference API has a constraint that a standard REST API does not: its most expensive resource is not a database connection or a network socket — it is the model artifact itself.

Published

Tue Oct 14 2025

Technologies Used

Python FastAPI
Intermediate 24 minutes

Purpose

The Problem

An ML inference API has a constraint that a standard REST API does not: its most expensive resource is not a database connection or a network socket — it is the model artifact itself. Loading a scikit-learn pipeline from disk is a blocking I/O operation followed by object deserialization and memory allocation. Do it inside a request handler and you are paying that cost on every single request, often adding hundreds of milliseconds of latency. VitalCheck loads eight models and two Parquet reference tables at startup. Naively reloading them per-request would make the API unusably slow.

The solution requires three interlocking FastAPI patterns working together: the Lifespan Context Manager (the right place to run startup logic), the Singleton Registry (the right data structure to hold shared state), and Dependency Injection (the right mechanism to hand that state to route handlers without coupling them to global variables). Mastering this combination is a prerequisite for building any production-grade FastAPI service that manages expensive shared resources.

We will build an application-scoped singleton that loads all artifacts once, stores them on app.state, and injects them into route handlers via FastAPI’s Depends() system — without a single global variable or per-request I/O.

What You Need in Your Toolkit: Async Python and FastAPI’s Request Lifecycle

Knowledge Base:

  • Python classes and __init__ methods
  • What a context manager is (the with statement and __enter__/__exit__)
  • Basic async/await syntax in Python
  • What FastAPI route handlers look like (@router.post(...))
  • Optional but helpful: understanding of what “application state” means in a web framework

Environment (from pyproject.toml):

Python        >= 3.11
fastapi       >= 0.115.0
uvicorn       >= 0.34.0
joblib        >= 1.4.0
onnxruntime   >= 1.21.0
pandas        >= 2.2.0
pydantic-settings >= 2.7.0

🔵 Deep Dive: FastAPI is built on Starlette, which runs on ASGI — the Asynchronous Server Gateway Interface. ASGI separates the web server (Uvicorn) from the application (FastAPI). The server manages the event loop; the application defines what runs on it. Lifespan events are the ASGI-level hook for running code before and after the application handles any requests.

The Airport Analogy: Ground Crew, Terminal, and Passengers

Think of the VitalCheck startup process as an airport coming online before the first flight.

The Ground Crew (the lifespan function) sets everything up before passengers arrive — loading fuel (models), stocking the terminal (reference data), and checking all systems. The Terminal (app.state) is the shared facility that stores everything the ground crew prepared. The Passengers (incoming requests) never interact with the ground crew directly; they use what the terminal provides. The Check-in Desk (the get_registry dependency function) is a fixed counter that passengers walk up to — it always hands them what they need from the terminal.

sequenceDiagram
    participant Uvicorn as Uvicorn (ASGI Server)
    participant Lifespan as lifespan() context manager
    participant Registry as ModelRegistry
    participant AppState as app.state
    participant Handler as Route Handler
    participant Depends as get_registry()

    Uvicorn->>Lifespan: startup signal
    Lifespan->>Registry: ModelRegistry()
    Lifespan->>Registry: .load_all(models_dir, reference_dir)
    Note over Registry: Loads 8 .pkl files,<br/>1 ONNX session,<br/>2 Parquet tables
    Lifespan->>AppState: app.state.registry = registry
    Lifespan-->>Uvicorn: yield (ready to serve)

    loop Per Request
        Uvicorn->>Handler: POST /api/v1/risk/diabetes
        Handler->>Depends: Depends(get_registry)
        Depends->>AppState: request.app.state.registry
        AppState-->>Depends: ModelRegistry instance (already loaded)
        Depends-->>Handler: registry
        Handler->>Registry: registry.diabetes.predict_proba(...)
        Handler-->>Uvicorn: VitalCheckResponse[...]
    end

    Uvicorn->>Lifespan: shutdown signal
    Lifespan-->>Uvicorn: cleanup complete

The Three-Part Machinery: Settings, Registry, and Injection

We will build through the code in three focused chunks that map directly to the three files involved: config.py, dependencies.py, and the route handler in risk.py.

Chunk 1 — Settings: The Configuration Singleton (app/config.py)

Before we can load models, we need to know where they live. pydantic-settings extends Pydantic’s BaseModel to read values from environment variables or a .env file automatically.

from pathlib import Path
from pydantic_settings import BaseSettings, SettingsConfigDict

class Settings(BaseSettings):
    # This tells pydantic-settings where to look for env vars.
    # 'extra="ignore"' means unknown env vars are silently ignored rather
    # than raising a ValidationError — important in containerized environments
    # where many unrelated vars may be set.
    model_config = SettingsConfigDict(
        env_file=".env",
        env_file_encoding="utf-8",
        extra="ignore"
    )

    # Path fields — pydantic-settings converts the string "data/models"
    # from the env var or .env file into a pathlib.Path automatically.
    models_dir:   Path = Path("data/models")
    reference_dir: Path = Path("data/reference")
    static_dir:   Path = Path("static")
    log_level:    str  = "info"
    api_version:  str  = "1.0.0"

# Module-level singleton — created once per process.
_settings: Settings | None = None

def get_settings() -> Settings:
    global _settings
    if _settings is None:
        # First call reads from environment / .env file.
        # Every subsequent call returns the already-parsed object.
        _settings = Settings()
    return _settings

This is the Initialization-on-First-Use singleton pattern. It is thread-safe in CPython because the GIL ensures that _settings = Settings() completes atomically for the initial assignment. The pattern avoids module-level side effects at import time.

Chunk 2 — The Registry: Holding All Artifacts in One Place (app/dependencies.py)

ModelRegistry is a plain Python class that acts as a typed container for every loaded artifact. All attributes start as None; load_all() populates them.

import joblib
import onnxruntime as ort
import pandas as pd

class ModelRegistry:
    """Holds all ML artifacts loaded once at startup."""

    def __init__(self) -> None:
        # All fields typed explicitly — None before load_all() is called,
        # populated after. Using typed attributes means IDE tooling can
        # autocomplete registry.diabetes without inspecting load_all().
        self.diabetes:             Any = None
        self.heart:                Any = None
        self.stroke:               Any = None
        self.breast_cancer:        Any = None
        self.sleep:                Any = None
        self.life_expectancy:      Any = None
        self.insurance:            tuple[Any, Any, Any] | None = None  # (mean, q05, q95)
        self.brain_tumor_session:  ort.InferenceSession | None = None
        self.fitbit_percentiles:   pd.DataFrame | None = None
        self.hospital_analytics:   pd.DataFrame | None = None
        self.registry_meta:        dict[str, Any] = {}

The load_all method is where the blocking I/O happens — intentionally, exactly once:

    def load_all(self, models_dir: Path, reference_dir: Path) -> None:
        logger.info("Loading ML artifacts from %s", models_dir)

        # joblib.load() deserializes a pickle-compatible binary file.
        # Each .pkl here is a fitted sklearn Pipeline object.
        self.diabetes      = joblib.load(models_dir / "diabetes_pipeline.pkl")
        self.heart         = joblib.load(models_dir / "heart_pipeline.pkl")
        self.stroke        = joblib.load(models_dir / "stroke_pipeline.pkl")
        self.breast_cancer = joblib.load(models_dir / "breast_cancer_pipeline.pkl")
        self.sleep         = joblib.load(models_dir / "sleep_pipeline.pkl")
        self.life_expectancy = joblib.load(models_dir / "life_expectancy_pipeline.pkl")
        self.insurance     = joblib.load(models_dir / "insurance_pipeline.pkl")

        # ONNX Runtime requires explicit session configuration.
        # Thread counts are set to 1 — this is a deliberate memory/CPU tradeoff
        # on a single-worker, 2GB VPS deployment (covered in Tutorial 3).
        opts = ort.SessionOptions()
        opts.inter_op_num_threads = 1
        opts.intra_op_num_threads = 1
        opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        self.brain_tumor_session = ort.InferenceSession(
            str(onnx_path),
            sess_options=opts,
            providers=["CPUExecutionProvider"],
        )

        # Reference data loaded from Parquet — a columnar binary format
        # that is dramatically faster to read than CSV for structured data.
        self.fitbit_percentiles  = pd.read_parquet(reference_dir / "fitbit_percentiles.parquet")
        self.hospital_analytics  = pd.read_parquet(reference_dir / "hospital_analytics.parquet")

Chunk 3 — The Lifespan and the Injection Point (app/dependencies.py, continued)

The lifespan context manager is where the registry is wired into the application:

from contextlib import asynccontextmanager
from fastapi import FastAPI, Request

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Everything BEFORE yield runs at startup — before the first request.
    settings = get_settings()
    registry = ModelRegistry()
    registry.load_all(settings.models_dir, settings.reference_dir)

    # app.state is Starlette's built-in key-value store for application-scoped
    # objects. It is attached to the ASGI app instance itself, not to any
    # specific request. Every request can reach it via request.app.state.
    app.state.registry = registry
    logger.info("VitalCheck API startup complete")

    yield  # Application is now live and serving requests.

    # Everything AFTER yield runs at shutdown.
    logger.info("VitalCheck API shutting down")
    # If registry objects held file handles or connections, you would close
    # them here. sklearn pipelines and ONNX sessions are in-memory only.

The lifespan is passed to the FastAPI constructor in main.py:

app = FastAPI(
    title="VitalCheck API",
    version="1.0.0",
    lifespan=lifespan,  # <-- FastAPI calls this automatically at ASGI startup/shutdown
    ...
)

Finally, the dependency function that routes use to access the registry:

def get_registry(request: Request) -> ModelRegistry:
    # request.app is the FastAPI application instance.
    # request.app.state is the same Starlette State object we wrote to in lifespan().
    # This is a single attribute lookup — effectively free.
    return request.app.state.registry

And a route handler consuming it:

@router.post("/diabetes", response_model=VitalCheckResponse[DiabetesPrediction])
async def predict_diabetes(
    req: DiabetesRequest,
    # FastAPI resolves Depends(get_registry) by calling get_registry(request)
    # and injecting the return value as the 'registry' parameter.
    # The route handler never imports or references ModelRegistry directly.
    registry: ModelRegistry = Depends(get_registry),
) -> VitalCheckResponse[DiabetesPrediction]:
    prob, contributors = predict_risk(registry.diabetes, features)
    ...

Why app.state Beats a Global Variable: Memory Layout and Thread Safety

The Global Variable Trap

The anti-pattern this design replaces looks like this:

# DO NOT DO THIS
_registry = None

def load_models():
    global _registry
    _registry = ModelRegistry()
    _registry.load_all(...)

# In route handler:
from app.some_module import _registry

This works in development but breaks in three ways in production:

  1. Import order dependency. If a route module is imported before load_models() is called, _registry is None at import time. This is a class of bug that is genuinely hard to reproduce outside production.
  2. No lifecycle guarantee. A global variable has no mechanism to ensure it is initialized before requests are served. The lifespan hook is a first-class ASGI guarantee.
  3. Testability. To test a route with a mock registry when using a global, you must monkeypatch the module’s global — a fragile approach. With Depends(get_registry), you can override the dependency in tests with app.dependency_overrides[get_registry] = lambda: mock_registry.

Memory Implications

ModelRegistry holds 26+ MB of in-memory model artifacts. Because it is stored on app.state and passed by reference through the dependency system, this memory is allocated exactly once. get_registry returns the same object on every call — not a copy. Route handlers hold a reference to the registry for the duration of a single request and then release it. The garbage collector never sees these objects as candidates for collection because app.state maintains a live reference.

🔵 Deep Dive: Starlette’s State object is simply a wrapper around a plain Python dictionary. app.state.registry = registry is equivalent to app.state.__dict__['registry'] = registry. The Request object exposes request.app as a reference to the ASGI application — not a copy. This is why get_registry(request) is a dictionary lookup, not a function call that does any real work.

When Startup Fails, When the Registry is Missing, and the Test Override Trick

What happens if load_all() raises an exception?

If any joblib.load() call fails (file missing, corrupted pickle, version mismatch), the exception propagates up through lifespan(), which causes Uvicorn to log the error and refuse to start. The application never enters a partially-initialized state where some models are loaded and others are not. This is the correct behavior — a half-initialized registry is more dangerous than a dead server, because it would serve some endpoints successfully and fail others in unpredictable ways.

What if a model is None at request time?

Route handlers guard against this explicitly:

# From app/routers/imaging.py
if registry.brain_tumor_session is None:
    raise HTTPException(status_code=503, detail="Brain tumor model not loaded")

This is the defensive check for the case where load_all() is called with a missing optional model. In practice, the startup failure above prevents this from ever occurring in production, but the 503 guard makes the failure mode explicit and debuggable rather than producing a cryptic AttributeError inside inference code.

🔴 Danger: The @asynccontextmanager decorator on lifespan means the function must contain exactly one yield. If you accidentally yield inside a loop or conditional, FastAPI will receive an unexpected generator state and raise a RuntimeError at startup. Always keep the lifespan body to the simple pattern: setup → yield → teardown.

The Test Override Pattern

Because the registry is injected via Depends, tests can swap it without touching production code:

# In your test file:
from fastapi.testclient import TestClient
from app.main import app
from app.dependencies import get_registry

mock_registry = MockModelRegistry()  # your test double

app.dependency_overrides[get_registry] = lambda: mock_registry

client = TestClient(app)
response = client.post("/api/v1/risk/diabetes", json={...})

This is the canonical FastAPI testing pattern. dependency_overrides is a dictionary on the app instance; FastAPI checks it before resolving any Depends() call and substitutes the override if one is present. No monkeypatching, no import manipulation.

You Now Know How to Wire Expensive Shared State into FastAPI Without Global Variables

You have learned a complete three-layer pattern for managing application-scoped resources:

  1. pydantic-settings + singleton getter — type-safe, environment-aware configuration that reads once and caches forever.
  2. ModelRegistry class + lifespan context manager — a guaranteed-once initialization hook that stores artifacts on app.state before the first request and provides a natural teardown point.
  3. Depends(get_registry) — a zero-cost dependency injection mechanism that decouples route handlers from the registry’s storage location, enabling clean test overrides.

The core skill transfer is this: app.state is the correct home for application-scoped singletons in FastAPI, and Depends() is the correct mechanism to access them in handlers. This pattern applies equally to database connection pools, HTTP clients, caches, feature flag clients, and any resource that is expensive to create and safe to share across requests.

We respect your privacy.

← View All Tutorials

Related Projects

    Ask me anything!