On this page
- The Problem
- The Solution
- What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics
- Knowledge Base
- Environment
- Two Engines, One Principle: Transforming Data Without Destroying Structure
- Part A: The Run-Splitting Algorithm — Rewriting a Document’s XML Skeleton
- Understanding Why This Is Hard
- The Position-Tracking Walker
- The Split Decision
- The Non-Matching Run Case
- Part B: The Audio Normalization Pipeline — From Raw WAV to Whisper Tensor
- The Format Gauntlet
- Step 1 — Stereo to Mono Conversion
- Step 2 — Data Type Normalization
- Step 3 — Fourier Resampling
- Step 4 — Model Invocation and Cleanup
- Memory Profiles, Allocation Chains, and Why Resampling Is Expensive
- Document Engine — Memory Characteristics
- Audio Engine — The Allocation Chain
- Big-O Summary
- What Breaks: XML Corruption, NaN Propagation, and the Singleton Trap
- Document Engine — The Dangling Element Risk
- Audio Engine — NaN Propagation
- Audio Engine — The Division-by-Zero Guard
- The Singleton Model — Cold Start and Memory Pinning
- You Now Know How to Mutate Complex Data Structures In Place Without Corruption
The Problem
Two of the hardest engineering challenges in STORM DAT operate at entirely different layers of the stack, but share a common theme: transforming data in place without corrupting its structure.
Challenge 1 — Document Annotation: When the acronym sweep engine identifies a three-character acronym buried in the middle of a 200-character text run, it must highlight only those three characters. But Word documents do not have a “highlight character range” API. The smallest unit of formatting is a run — a contiguous text segment with uniform properties. To highlight a subset of a run, you must split it into three new runs, apply the color to the middle segment, preserve all formatting on the outer segments, and remove the original run from the XML tree. One mistake corrupts the document.
Challenge 2 — Audio Normalization: When a user uploads a recording for AI transcription, the audio may arrive as stereo int16 at 44.1kHz, but the Whisper model demands mono float32 at 16kHz. The pipeline must convert the data type, merge channels, resample the waveform using Fourier interpolation, and do all of this on arrays that can be hundreds of megabytes — without doubling memory consumption or losing audio fidelity.
The Solution
We will dissect two algorithms: the highlight_text() method that performs surgical run splitting on python-docx’s XML tree, and the audio normalization pipeline in routes.py that transforms raw WAV data into Whisper-ready tensors. Together, they demonstrate advanced patterns in in-place data structure mutation, memory-conscious array processing, and the tradeoffs of working with library internals.
What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics
Knowledge Base
- python-docx run model: how
Paragraph.runsmaps to underlying<w:r>XML elements - lxml element manipulation:
_element,.remove(), and the difference between Python object references and XML tree nodes - NumPy dtype system:
int16,float32,uint8— how arrays store numbers and what happens during type casting - Signal processing fundamentals: sample rate, Nyquist theorem, and why resampling is not simple array slicing
Environment
- Python 3.12
- python-docx 1.1.2 — Word document manipulation (built on lxml)
- NumPy 2.2.5 — array operations
- SciPy 1.15.3 —
scipy.signal.resamplefor Fourier-based resampling - OpenAI Whisper — transcription model (loaded as a singleton)
pip install python-docx==1.1.2 numpy==2.2.5 scipy==1.15.3 openai-whisper
Two Engines, One Principle: Transforming Data Without Destroying Structure
flowchart TD
subgraph Document_Engine["Document Annotation Engine"]
A1[Identify target character range] --> A2[Walk runs, track position counter]
A2 --> A3{Target within this run?}
A3 -->|Yes| A4[Split into before / target / after]
A4 --> A5[Create 3 new runs with preserved formatting]
A5 --> A6[Remove original XML element]
A3 -->|No| A7[Clone run as-is, remove original]
A7 --> A2
end
subgraph Audio_Engine["Audio Normalization Engine"]
B1[Read WAV file] --> B2{Stereo?}
B2 -->|Yes| B3[Mean channels → mono]
B2 -->|No| B4[Skip]
B3 --> B5[Normalize dtype → float32]
B4 --> B5
B5 --> B6{Sample rate = 16kHz?}
B6 -->|No| B7[Fourier resample to 16kHz]
B6 -->|Yes| B8[Skip]
B7 --> B9[Pass to Whisper]
B8 --> B9
end
style A4 fill:#ffcccc,stroke:#333
style B7 fill:#ffcccc,stroke:#333
Analogy: Both engines perform the same conceptual operation — they are jewelers resetting a gemstone. The document engine removes a run from its setting (the XML tree), carefully cuts it into pieces, resets each piece with its original properties, and places them back. The audio engine takes a raw recording, reshapes it to fit a specific mount (Whisper’s input format), and polishes it (normalization) so the model can work with it. In both cases, the transformation must be lossless with respect to what matters: formatting for documents, fidelity for audio.
Part A: The Run-Splitting Algorithm — Rewriting a Document’s XML Skeleton
Understanding Why This Is Hard
A Word .docx file is a ZIP archive containing XML files. The main document body is word/document.xml. Each paragraph is a <w:p> element containing one or more <w:r> (run) elements:
<w:p>
<w:r>
<w:rPr><w:rFonts w:ascii="Arial"/><w:sz w:val="24"/></w:rPr>
<w:t>The Department of Defense published the report.</w:t>
</w:r>
</w:p>
If we need to highlight only “Defense” (positions 21-28), we cannot simply “color characters 21-28.” We must split the single run into three:
"The Department of " | "Defense" | " published the report."
Each new run must carry the original’s formatting (Arial, 12pt), and only the middle run gets the highlight color. The original run must then be removed from the XML tree.
The Position-Tracking Walker
The highlight_text() method walks through all runs, maintaining a current_pos counter — a manual index into the paragraph’s concatenated text.
def highlight_text(self, para, position, length, color):
"""Find the correct run, split it if needed, and highlight."""
current_pos = 0
for run in para.runs:
run_length = len(run.text)
# Capture formatting BEFORE any mutation
font = run.font.name
size = run.font.size
highlight = run.font.highlight_color
# Store the XML element reference for later removal
delete_element = run._element
🔴 Danger:
run._elementaccesses python-docx’s internal lxml element — a private API. This is necessary because python-docx does not expose a public method to remove a run from a paragraph. Using private APIs means this code could break in a future python-docx release. The tradeoff is accepted because there is no public alternative.
The Split Decision
The critical branch: does the target range fall within this run?
if current_pos <= position < current_pos + run_length:
# Target IS within this run — split it
offset = position - current_pos
# Carve the text into three segments
before_text = run.text[:offset]
affected_text = run.text[offset:offset + length]
after_text = run.text[offset + length:]
Three new runs are then created, each inheriting the original’s formatting:
# Segment 1: text before the highlight (original formatting)
if before_text:
new_run = para.add_run(before_text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
# Segment 2: the highlighted text (receives the finding color)
if affected_text:
highlighted_run = para.add_run(affected_text)
highlighted_run.font.name = font
highlighted_run.font.size = size
highlighted_run.font.highlight_color = color # THE highlight
# Segment 3: text after the highlight (original formatting)
if after_text:
new_run = para.add_run(after_text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
The Non-Matching Run Case
If the target does not fall in this run, the run is still cloned and the original removed. This is because the method rebuilds the entire paragraph’s run sequence — it cannot selectively modify one run and leave others untouched, since para.add_run() appends to the end of the run list.
else:
# Target is NOT in this run — clone it verbatim
new_run = para.add_run(run.text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
# Advance the position counter
current_pos += run_length
# Remove the original XML element from the paragraph
para._element.remove(delete_element)
🔵 Deep Dive:
para._element.remove(delete_element)operates on the lxml etree level. lxml elements are nodes in a tree structure;.remove()detaches the node from its parent without destroying it (Python’s garbage collector will handle that). The new runs added viapara.add_run()are appended as new<w:r>children of the<w:p>element. The net effect: the original XML is replaced with a semantically identical but structurally refined version.
Part B: The Audio Normalization Pipeline — From Raw WAV to Whisper Tensor
The Format Gauntlet
OpenAI’s Whisper model accepts exactly one input format: a 1D float32 NumPy array at 16,000 samples per second. But audio captured by a browser’s MediaRecorder API can arrive in any combination of:
| Property | Possible Values |
|---|---|
| Channels | Mono (1D array) or Stereo (2D array) |
| Data type | int16, int32, uint8, float32, float64 |
| Sample rate | 8kHz, 22.05kHz, 44.1kHz, 48kHz, or others |
The pipeline must handle every combination.
Step 1 — Stereo to Mono Conversion
sample_rate, data = wavfile.read(save_path)
# Check dimensionality: stereo audio has shape (samples, 2)
if len(data.shape) == 2:
# Average both channels into one
data = data.mean(axis=1)
# Normalize to [-1, 1] range to prevent clipping
max_val = np.max(np.abs(data))
if max_val > 0:
data = data / max_val
data.mean(axis=1) computes the element-wise mean across the channel dimension. For a stereo array of shape (480000, 2), this produces a mono array of shape (480000,). The normalization step divides by the maximum absolute value to ensure no sample exceeds the [-1, 1] float range — preventing clipping artifacts in the transcription.
Step 2 — Data Type Normalization
Each integer format has a different range that must be mapped to float32 in the [-1, 1] range:
if data.dtype == np.int16:
# int16 range: [-32768, 32767]
data = data.astype(np.float32) / 32768.0
elif data.dtype == np.int32:
# int32 range: [-2147483648, 2147483647]
data = data.astype(np.float32) / 2147483648.0
elif data.dtype == np.uint8:
# uint8 range: [0, 255] — center at 128, then normalize
data = (data.astype(np.float32) - 128) / 128.0
elif data.dtype == np.float64:
# Already float, just downcast for Whisper compatibility
data = data.astype(np.float32)
elif data.dtype != np.float32:
return jsonify({'error': f'Unsupported sample type: {data.dtype}'}), 400
🔵 Deep Dive: The division constants (32768, 2147483648, 128) are not arbitrary — they are 2^15, 2^31, and 2^7: half the range of each integer type. Dividing by these values maps the full integer range to exactly [-1.0, 1.0] in float32. The
uint8case is asymmetric (0-255, center at 128) because unsigned audio uses 128 as the “silence” value rather than 0.
Step 3 — Fourier Resampling
If the sample rate is not 16kHz, the array must be resampled:
target_sr = 16000
if sample_rate != target_sr:
# Calculate the new array length proportionally
new_len = int(len(data) * target_sr / sample_rate)
# Fourier-based resampling
data = resample(data, new_len)
sample_rate = target_sr
scipy.signal.resample uses the Fourier method: it transforms the signal into the frequency domain via FFT, truncates or zero-pads the frequency components to match the target length, and then transforms back via inverse FFT. This preserves all frequency content below the Nyquist limit of the target sample rate (8kHz for 16kHz sampling), producing higher-quality results than simple linear interpolation.
Step 4 — Model Invocation and Cleanup
# Singleton model loaded once at app startup
model = current_app.whisper_model
if model is None:
return jsonify({"error": "Transcription service unavailable"}), 503
# Transcribe — fp16=False because not all hardware supports half-precision
result = model.transcribe(audio=data, language='en', fp16=False)
return jsonify({
"transcription": result["text"],
"segments": result["segments"]
})
The finally block ensures the temporary upload file is deleted even if transcription fails:
finally:
try:
if os.path.exists(save_path):
os.remove(save_path)
except Exception as cleanup_error:
current_app.logger.warning(f"Failed to cleanup: {cleanup_error}")
Memory Profiles, Allocation Chains, and Why Resampling Is Expensive
Document Engine — Memory Characteristics
The run-splitting algorithm’s memory behavior is proportional to the number of runs in a paragraph, not the document size. Each para.add_run() call creates a new python-docx Run object (~200 bytes Python overhead) and a new lxml <w:r> element (~400 bytes). For a paragraph with 5 runs and 3 highlights, the peak is approximately 8 run objects * 600 bytes = ~5KB. Document-level memory is dominated by the python-docx Document object itself (typically 5-50MB for large documents), not by the annotation process.
Audio Engine — The Allocation Chain
For a 5-minute stereo recording at 48kHz/16-bit:
| Step | Array Shape | Dtype | Memory |
|---|---|---|---|
| Raw read | (14,400,000, 2) | int16 | 55 MB |
After .mean(axis=1) | (14,400,000,) | float64 | 110 MB (Pandas upcasts) |
After /max_val | (14,400,000,) | float64 | 110 MB (in-place) |
After .astype(float32) | (14,400,000,) | float32 | 55 MB (new array) |
After resample() | (4,800,000,) | float64 | 37 MB (SciPy returns float64) |
🔴 Danger: Notice the peak at the
meanstep: NumPy’smean()on an int16 array returns float64 by default (to avoid overflow during summation). This means a 55MB int16 array becomes a 110MB float64 array — a 2x memory spike. For a 500MB media file (the configured maximum), this spike reaches 1GB. Combined with the FFT buffers thatresample()allocates internally (which also operate in float64), peak memory for the audio pipeline can reach 3-4x the input file size.
This is why Gunicorn is configured with only 4 workers: each worker may consume up to 2GB during audio processing, and the server needs headroom for the Whisper model itself (~1.5GB for the “medium” model loaded at startup).
Big-O Summary
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Run splitting (per paragraph) | O(R) where R = runs | O(R) |
| Stereo → mono | O(N) where N = samples | O(N) — new array |
| dtype cast | O(N) | O(N) — new array |
| Fourier resample | O(N log N) — FFT | O(N) — frequency buffer |
| Whisper transcription | O(N) — sequential model inference | O(model) — ~1.5GB |
The most expensive operation is resample at O(N log N) due to the FFT. For a 5-minute recording at 48kHz (14.4 million samples), the FFT processes ~24 million log2(24M) ≈ 576 million operations — still under a second on modern CPUs, but it is the bottleneck in the pipeline.
What Breaks: XML Corruption, NaN Propagation, and the Singleton Trap
Document Engine — The Dangling Element Risk
If para._element.remove(delete_element) is called before the new runs are appended, the paragraph temporarily has fewer runs than expected. If an exception occurs between the remove and the append, the document is left in a corrupted state with missing text. The current implementation mitigates this by always appending new runs before removing the original — but this relies on para.add_run() never throwing. If the document’s XML tree is malformed (e.g., missing namespace declarations), lxml could raise an XMLSyntaxError during add_run(), leaving both the original and partial new runs in the paragraph.
Audio Engine — NaN Propagation
If the audio data contains NaN values (possible with corrupted WAV files), np.max(np.abs(data)) returns NaN, and data / NaN produces an array of all NaNs. Whisper would then receive silence and return an empty transcription — a silent failure with no error message. A production-hardened version would add a np.isnan(data).any() check before normalization.
Audio Engine — The Division-by-Zero Guard
The code checks if max_val > 0 before dividing. This guard catches the edge case of a completely silent recording (all zeros). Without it, the pipeline would divide by zero and produce inf or NaN values in the array.
The Singleton Model — Cold Start and Memory Pinning
Whisper is loaded once at startup via app.whisper_model = whisper.load_model("medium"). This means:
- Cold start penalty: The first application boot takes 30-60 seconds to load the model. In a Kubernetes environment with health checks, the readiness probe must account for this delay.
- Memory pinning: The ~1.5GB model lives in each Gunicorn worker’s memory for the entire process lifetime. With 4 workers, that is 6GB dedicated to models alone. This is why the Dockerfile uses
gunicornwith a--timeout 600— the extended timeout accommodates both model loading and long transcription jobs. - No hot-reload: If the model needs updating, the entire application must restart. There is no mechanism to swap models at runtime.
🔵 Deep Dive: Gunicorn’s
preload_appoption (not currently used in STORM DAT) can share the model across workers via copy-on-write memory in forked processes. This would reduce total memory from 6GB (4 copies) to ~1.5GB (1 shared copy) — a significant optimization for memory-constrained deployments. The tradeoff is thatpreload_appis incompatible with some debugging and reload workflows.
You Now Know How to Mutate Complex Data Structures In Place Without Corruption
The unifying skill across both engines is structural transformation with invariant preservation:
-
Capture state before mutation. Both algorithms snapshot the original properties (font name/size for runs, sample rate/dtype for audio) before modifying anything. This “capture-transform-apply” pattern ensures no information is lost during the transformation.
-
Rebuild, don’t patch. The run-splitting algorithm does not try to modify a run in place — it rebuilds the entire paragraph’s run sequence. The audio pipeline does not try to modify int16 samples in place — it creates a new float32 array. Rebuilding is more memory-expensive but eliminates an entire class of partial-mutation bugs.
-
Clean up unconditionally. Both engines use cleanup mechanisms (
_element.remove()for XML,finallyblock for temp files) that execute regardless of success or failure. Leaked resources — orphaned XML elements, abandoned temporary files — are treated as bugs, not acceptable edge cases. -
Understand your library’s internal model. The run-splitting algorithm works because the developer understood that python-docx runs are backed by lxml elements, and that
para.add_run()appends to the XML tree. The audio pipeline works because the developer understood that NumPy’smean()upcasts to float64 and that SciPy’sresampleuses FFT internally. Surface-level API knowledge is not enough for advanced work — you must understand the layer beneath.