featured image

Splitting Atoms in a Word Document: Surgical Run Manipulation and Real-Time Audio Normalization

A deep dive into the algorithms that power STORM DAT's document annotation and audio normalization engines, revealing how to transform complex data structures in place without corruption.

Published

Sat Feb 21 2026

Technologies Used

Python Whisper
Advanced 35 minutes

The Problem

Two of the hardest engineering challenges in STORM DAT operate at entirely different layers of the stack, but share a common theme: transforming data in place without corrupting its structure.

Challenge 1 — Document Annotation: When the acronym sweep engine identifies a three-character acronym buried in the middle of a 200-character text run, it must highlight only those three characters. But Word documents do not have a “highlight character range” API. The smallest unit of formatting is a run — a contiguous text segment with uniform properties. To highlight a subset of a run, you must split it into three new runs, apply the color to the middle segment, preserve all formatting on the outer segments, and remove the original run from the XML tree. One mistake corrupts the document.

Challenge 2 — Audio Normalization: When a user uploads a recording for AI transcription, the audio may arrive as stereo int16 at 44.1kHz, but the Whisper model demands mono float32 at 16kHz. The pipeline must convert the data type, merge channels, resample the waveform using Fourier interpolation, and do all of this on arrays that can be hundreds of megabytes — without doubling memory consumption or losing audio fidelity.

The Solution

We will dissect two algorithms: the highlight_text() method that performs surgical run splitting on python-docx’s XML tree, and the audio normalization pipeline in routes.py that transforms raw WAV data into Whisper-ready tensors. Together, they demonstrate advanced patterns in in-place data structure mutation, memory-conscious array processing, and the tradeoffs of working with library internals.

What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics

Knowledge Base

  • python-docx run model: how Paragraph.runs maps to underlying <w:r> XML elements
  • lxml element manipulation: _element, .remove(), and the difference between Python object references and XML tree nodes
  • NumPy dtype system: int16, float32, uint8 — how arrays store numbers and what happens during type casting
  • Signal processing fundamentals: sample rate, Nyquist theorem, and why resampling is not simple array slicing

Environment

  • Python 3.12
  • python-docx 1.1.2 — Word document manipulation (built on lxml)
  • NumPy 2.2.5 — array operations
  • SciPy 1.15.3scipy.signal.resample for Fourier-based resampling
  • OpenAI Whisper — transcription model (loaded as a singleton)
pip install python-docx==1.1.2 numpy==2.2.5 scipy==1.15.3 openai-whisper

Two Engines, One Principle: Transforming Data Without Destroying Structure

flowchart TD
    subgraph Document_Engine["Document Annotation Engine"]
        A1[Identify target character range] --> A2[Walk runs, track position counter]
        A2 --> A3{Target within this run?}
        A3 -->|Yes| A4[Split into before / target / after]
        A4 --> A5[Create 3 new runs with preserved formatting]
        A5 --> A6[Remove original XML element]
        A3 -->|No| A7[Clone run as-is, remove original]
        A7 --> A2
    end

    subgraph Audio_Engine["Audio Normalization Engine"]
        B1[Read WAV file] --> B2{Stereo?}
        B2 -->|Yes| B3[Mean channels → mono]
        B2 -->|No| B4[Skip]
        B3 --> B5[Normalize dtype → float32]
        B4 --> B5
        B5 --> B6{Sample rate = 16kHz?}
        B6 -->|No| B7[Fourier resample to 16kHz]
        B6 -->|Yes| B8[Skip]
        B7 --> B9[Pass to Whisper]
        B8 --> B9
    end

    style A4 fill:#ffcccc,stroke:#333
    style B7 fill:#ffcccc,stroke:#333

Analogy: Both engines perform the same conceptual operation — they are jewelers resetting a gemstone. The document engine removes a run from its setting (the XML tree), carefully cuts it into pieces, resets each piece with its original properties, and places them back. The audio engine takes a raw recording, reshapes it to fit a specific mount (Whisper’s input format), and polishes it (normalization) so the model can work with it. In both cases, the transformation must be lossless with respect to what matters: formatting for documents, fidelity for audio.

Part A: The Run-Splitting Algorithm — Rewriting a Document’s XML Skeleton

Understanding Why This Is Hard

A Word .docx file is a ZIP archive containing XML files. The main document body is word/document.xml. Each paragraph is a <w:p> element containing one or more <w:r> (run) elements:

<w:p>
  <w:r>
    <w:rPr><w:rFonts w:ascii="Arial"/><w:sz w:val="24"/></w:rPr>
    <w:t>The Department of Defense published the report.</w:t>
  </w:r>
</w:p>

If we need to highlight only “Defense” (positions 21-28), we cannot simply “color characters 21-28.” We must split the single run into three:

"The Department of " | "Defense" | " published the report."

Each new run must carry the original’s formatting (Arial, 12pt), and only the middle run gets the highlight color. The original run must then be removed from the XML tree.

The Position-Tracking Walker

The highlight_text() method walks through all runs, maintaining a current_pos counter — a manual index into the paragraph’s concatenated text.

def highlight_text(self, para, position, length, color):
    """Find the correct run, split it if needed, and highlight."""
    current_pos = 0

    for run in para.runs:
        run_length = len(run.text)

        # Capture formatting BEFORE any mutation
        font = run.font.name
        size = run.font.size
        highlight = run.font.highlight_color

        # Store the XML element reference for later removal
        delete_element = run._element

🔴 Danger: run._element accesses python-docx’s internal lxml element — a private API. This is necessary because python-docx does not expose a public method to remove a run from a paragraph. Using private APIs means this code could break in a future python-docx release. The tradeoff is accepted because there is no public alternative.

The Split Decision

The critical branch: does the target range fall within this run?

        if current_pos <= position < current_pos + run_length:
            # Target IS within this run — split it
            offset = position - current_pos

            # Carve the text into three segments
            before_text = run.text[:offset]
            affected_text = run.text[offset:offset + length]
            after_text = run.text[offset + length:]

Three new runs are then created, each inheriting the original’s formatting:

            # Segment 1: text before the highlight (original formatting)
            if before_text:
                new_run = para.add_run(before_text)
                new_run.font.name = font
                new_run.font.size = size
                new_run.font.highlight_color = highlight

            # Segment 2: the highlighted text (receives the finding color)
            if affected_text:
                highlighted_run = para.add_run(affected_text)
                highlighted_run.font.name = font
                highlighted_run.font.size = size
                highlighted_run.font.highlight_color = color  # THE highlight

            # Segment 3: text after the highlight (original formatting)
            if after_text:
                new_run = para.add_run(after_text)
                new_run.font.name = font
                new_run.font.size = size
                new_run.font.highlight_color = highlight

The Non-Matching Run Case

If the target does not fall in this run, the run is still cloned and the original removed. This is because the method rebuilds the entire paragraph’s run sequence — it cannot selectively modify one run and leave others untouched, since para.add_run() appends to the end of the run list.

        else:
            # Target is NOT in this run — clone it verbatim
            new_run = para.add_run(run.text)
            new_run.font.name = font
            new_run.font.size = size
            new_run.font.highlight_color = highlight

        # Advance the position counter
        current_pos += run_length

        # Remove the original XML element from the paragraph
        para._element.remove(delete_element)

🔵 Deep Dive: para._element.remove(delete_element) operates on the lxml etree level. lxml elements are nodes in a tree structure; .remove() detaches the node from its parent without destroying it (Python’s garbage collector will handle that). The new runs added via para.add_run() are appended as new <w:r> children of the <w:p> element. The net effect: the original XML is replaced with a semantically identical but structurally refined version.

Part B: The Audio Normalization Pipeline — From Raw WAV to Whisper Tensor

The Format Gauntlet

OpenAI’s Whisper model accepts exactly one input format: a 1D float32 NumPy array at 16,000 samples per second. But audio captured by a browser’s MediaRecorder API can arrive in any combination of:

PropertyPossible Values
ChannelsMono (1D array) or Stereo (2D array)
Data typeint16, int32, uint8, float32, float64
Sample rate8kHz, 22.05kHz, 44.1kHz, 48kHz, or others

The pipeline must handle every combination.

Step 1 — Stereo to Mono Conversion

sample_rate, data = wavfile.read(save_path)

# Check dimensionality: stereo audio has shape (samples, 2)
if len(data.shape) == 2:
    # Average both channels into one
    data = data.mean(axis=1)

    # Normalize to [-1, 1] range to prevent clipping
    max_val = np.max(np.abs(data))
    if max_val > 0:
        data = data / max_val

data.mean(axis=1) computes the element-wise mean across the channel dimension. For a stereo array of shape (480000, 2), this produces a mono array of shape (480000,). The normalization step divides by the maximum absolute value to ensure no sample exceeds the [-1, 1] float range — preventing clipping artifacts in the transcription.

Step 2 — Data Type Normalization

Each integer format has a different range that must be mapped to float32 in the [-1, 1] range:

if data.dtype == np.int16:
    # int16 range: [-32768, 32767]
    data = data.astype(np.float32) / 32768.0

elif data.dtype == np.int32:
    # int32 range: [-2147483648, 2147483647]
    data = data.astype(np.float32) / 2147483648.0

elif data.dtype == np.uint8:
    # uint8 range: [0, 255] — center at 128, then normalize
    data = (data.astype(np.float32) - 128) / 128.0

elif data.dtype == np.float64:
    # Already float, just downcast for Whisper compatibility
    data = data.astype(np.float32)

elif data.dtype != np.float32:
    return jsonify({'error': f'Unsupported sample type: {data.dtype}'}), 400

🔵 Deep Dive: The division constants (32768, 2147483648, 128) are not arbitrary — they are 2^15, 2^31, and 2^7: half the range of each integer type. Dividing by these values maps the full integer range to exactly [-1.0, 1.0] in float32. The uint8 case is asymmetric (0-255, center at 128) because unsigned audio uses 128 as the “silence” value rather than 0.

Step 3 — Fourier Resampling

If the sample rate is not 16kHz, the array must be resampled:

target_sr = 16000
if sample_rate != target_sr:
    # Calculate the new array length proportionally
    new_len = int(len(data) * target_sr / sample_rate)

    # Fourier-based resampling
    data = resample(data, new_len)
    sample_rate = target_sr

scipy.signal.resample uses the Fourier method: it transforms the signal into the frequency domain via FFT, truncates or zero-pads the frequency components to match the target length, and then transforms back via inverse FFT. This preserves all frequency content below the Nyquist limit of the target sample rate (8kHz for 16kHz sampling), producing higher-quality results than simple linear interpolation.

Step 4 — Model Invocation and Cleanup

# Singleton model loaded once at app startup
model = current_app.whisper_model
if model is None:
    return jsonify({"error": "Transcription service unavailable"}), 503

# Transcribe — fp16=False because not all hardware supports half-precision
result = model.transcribe(audio=data, language='en', fp16=False)

return jsonify({
    "transcription": result["text"],
    "segments": result["segments"]
})

The finally block ensures the temporary upload file is deleted even if transcription fails:

finally:
    try:
        if os.path.exists(save_path):
            os.remove(save_path)
    except Exception as cleanup_error:
        current_app.logger.warning(f"Failed to cleanup: {cleanup_error}")

Memory Profiles, Allocation Chains, and Why Resampling Is Expensive

Document Engine — Memory Characteristics

The run-splitting algorithm’s memory behavior is proportional to the number of runs in a paragraph, not the document size. Each para.add_run() call creates a new python-docx Run object (~200 bytes Python overhead) and a new lxml <w:r> element (~400 bytes). For a paragraph with 5 runs and 3 highlights, the peak is approximately 8 run objects * 600 bytes = ~5KB. Document-level memory is dominated by the python-docx Document object itself (typically 5-50MB for large documents), not by the annotation process.

Audio Engine — The Allocation Chain

For a 5-minute stereo recording at 48kHz/16-bit:

StepArray ShapeDtypeMemory
Raw read(14,400,000, 2)int1655 MB
After .mean(axis=1)(14,400,000,)float64110 MB (Pandas upcasts)
After /max_val(14,400,000,)float64110 MB (in-place)
After .astype(float32)(14,400,000,)float3255 MB (new array)
After resample()(4,800,000,)float6437 MB (SciPy returns float64)

🔴 Danger: Notice the peak at the mean step: NumPy’s mean() on an int16 array returns float64 by default (to avoid overflow during summation). This means a 55MB int16 array becomes a 110MB float64 array — a 2x memory spike. For a 500MB media file (the configured maximum), this spike reaches 1GB. Combined with the FFT buffers that resample() allocates internally (which also operate in float64), peak memory for the audio pipeline can reach 3-4x the input file size.

This is why Gunicorn is configured with only 4 workers: each worker may consume up to 2GB during audio processing, and the server needs headroom for the Whisper model itself (~1.5GB for the “medium” model loaded at startup).

Big-O Summary

OperationTime ComplexitySpace Complexity
Run splitting (per paragraph)O(R) where R = runsO(R)
Stereo → monoO(N) where N = samplesO(N) — new array
dtype castO(N)O(N) — new array
Fourier resampleO(N log N) — FFTO(N) — frequency buffer
Whisper transcriptionO(N) — sequential model inferenceO(model) — ~1.5GB

The most expensive operation is resample at O(N log N) due to the FFT. For a 5-minute recording at 48kHz (14.4 million samples), the FFT processes ~24 million log2(24M) ≈ 576 million operations — still under a second on modern CPUs, but it is the bottleneck in the pipeline.

What Breaks: XML Corruption, NaN Propagation, and the Singleton Trap

Document Engine — The Dangling Element Risk

If para._element.remove(delete_element) is called before the new runs are appended, the paragraph temporarily has fewer runs than expected. If an exception occurs between the remove and the append, the document is left in a corrupted state with missing text. The current implementation mitigates this by always appending new runs before removing the original — but this relies on para.add_run() never throwing. If the document’s XML tree is malformed (e.g., missing namespace declarations), lxml could raise an XMLSyntaxError during add_run(), leaving both the original and partial new runs in the paragraph.

Audio Engine — NaN Propagation

If the audio data contains NaN values (possible with corrupted WAV files), np.max(np.abs(data)) returns NaN, and data / NaN produces an array of all NaNs. Whisper would then receive silence and return an empty transcription — a silent failure with no error message. A production-hardened version would add a np.isnan(data).any() check before normalization.

Audio Engine — The Division-by-Zero Guard

The code checks if max_val > 0 before dividing. This guard catches the edge case of a completely silent recording (all zeros). Without it, the pipeline would divide by zero and produce inf or NaN values in the array.

The Singleton Model — Cold Start and Memory Pinning

Whisper is loaded once at startup via app.whisper_model = whisper.load_model("medium"). This means:

  • Cold start penalty: The first application boot takes 30-60 seconds to load the model. In a Kubernetes environment with health checks, the readiness probe must account for this delay.
  • Memory pinning: The ~1.5GB model lives in each Gunicorn worker’s memory for the entire process lifetime. With 4 workers, that is 6GB dedicated to models alone. This is why the Dockerfile uses gunicorn with a --timeout 600 — the extended timeout accommodates both model loading and long transcription jobs.
  • No hot-reload: If the model needs updating, the entire application must restart. There is no mechanism to swap models at runtime.

🔵 Deep Dive: Gunicorn’s preload_app option (not currently used in STORM DAT) can share the model across workers via copy-on-write memory in forked processes. This would reduce total memory from 6GB (4 copies) to ~1.5GB (1 shared copy) — a significant optimization for memory-constrained deployments. The tradeoff is that preload_app is incompatible with some debugging and reload workflows.

You Now Know How to Mutate Complex Data Structures In Place Without Corruption

The unifying skill across both engines is structural transformation with invariant preservation:

  1. Capture state before mutation. Both algorithms snapshot the original properties (font name/size for runs, sample rate/dtype for audio) before modifying anything. This “capture-transform-apply” pattern ensures no information is lost during the transformation.

  2. Rebuild, don’t patch. The run-splitting algorithm does not try to modify a run in place — it rebuilds the entire paragraph’s run sequence. The audio pipeline does not try to modify int16 samples in place — it creates a new float32 array. Rebuilding is more memory-expensive but eliminates an entire class of partial-mutation bugs.

  3. Clean up unconditionally. Both engines use cleanup mechanisms (_element.remove() for XML, finally block for temp files) that execute regardless of success or failure. Leaked resources — orphaned XML elements, abandoned temporary files — are treated as bugs, not acceptable edge cases.

  4. Understand your library’s internal model. The run-splitting algorithm works because the developer understood that python-docx runs are backed by lxml elements, and that para.add_run() appends to the XML tree. The audio pipeline works because the developer understood that NumPy’s mean() upcasts to float64 and that SciPy’s resample uses FFT internally. Surface-level API knowledge is not enough for advanced work — you must understand the layer beneath.

We respect your privacy.

← View All Tutorials

Related Projects

    Ask me anything!