On this page
- Two Hard Problems That Share One Principle
- What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics
- The Run-Splitting Algorithm
- Why Word Documents Make This Hard
- The Position-Tracking Walker
- The Split Decision
- The Non-Matching Run Case
- The Audio Normalization Pipeline
- The Format Problem
- Step 1: Stereo to Mono
- Step 2: Data Type Normalization
- Step 3: Fourier Resampling
- Step 4: Model Invocation
- The Memory Spike You Need to Plan For
- What Both Engines Have in Common
Two Hard Problems That Share One Principle
Two of the hardest engineering challenges in STORM DAT operate at entirely different layers of the stack, but share a common theme: transforming data in place without corrupting its structure.
Challenge 1 — Document Annotation: When the acronym sweep engine identifies a three-character acronym buried in the middle of a 200-character text run, it must highlight only those three characters. But Word documents don’t have a “highlight character range” API. The smallest unit of formatting is a run — a contiguous text segment with uniform properties. To highlight a subset of a run, you must split it into three new runs, apply the color to the middle segment, preserve all formatting on the outer segments, and remove the original run from the XML tree. One mistake corrupts the document.
Challenge 2 — Audio Normalization: When a user uploads a recording for AI transcription, the audio may arrive as stereo int16 at 44.1kHz, but Whisper demands mono float32 at 16kHz. The pipeline must convert the data type, merge channels, resample the waveform using Fourier interpolation, and do all of this on arrays that can be hundreds of megabytes — without doubling memory consumption or losing audio fidelity.
What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics
Knowledge Base:
- python-docx run model: how
Paragraph.runsmaps to underlying<w:r>XML elements - lxml element manipulation:
_element,.remove(), and the difference between Python object references and XML tree nodes - NumPy dtype system:
int16,float32,uint8— how arrays store numbers and what happens during type casting - Signal processing fundamentals: sample rate, Nyquist theorem, and why resampling isn’t simple array slicing
Environment:
- Python 3.12
- python-docx 1.1.2
- NumPy 2.2.5
- SciPy 1.15.3
- OpenAI Whisper (loaded as a singleton)
pip install python-docx==1.1.2 numpy==2.2.5 scipy==1.15.3 openai-whisper
Both engines perform the same conceptual operation — they’re jewelers resetting a gemstone. The document engine removes a run from its setting (the XML tree), carefully cuts it into pieces, resets each piece with its original properties, and places them back. The audio engine takes a raw recording, reshapes it to fit a specific mount (Whisper’s input format), and polishes it so the model can work with it. In both cases, the transformation must be lossless with respect to what matters: formatting for documents, fidelity for audio.
The Run-Splitting Algorithm
Why Word Documents Make This Hard
A .docx file is a ZIP archive containing XML. The main document body is word/document.xml. Each paragraph is a <w:p> element containing one or more <w:r> (run) elements:
<w:p>
<w:r>
<w:rPr><w:rFonts w:ascii="Arial"/><w:sz w:val="24"/></w:rPr>
<w:t>The Department of Defense published the report.</w:t>
</w:r>
</w:p>
To highlight only “Defense” (positions 21-28), we can’t color characters 21-28. We must split the single run into three:
"The Department of " | "Defense" | " published the report."
Each new run must carry the original’s formatting (Arial, 12pt), and only the middle one gets the highlight color. Then the original run must be removed from the XML tree.
The Position-Tracking Walker
The highlight_text() method walks through all runs, maintaining a current_pos counter — a manual index into the paragraph’s concatenated text.
def highlight_text(self, para, position, length, color):
current_pos = 0
for run in para.runs:
run_length = len(run.text)
# Capture formatting BEFORE any mutation
font = run.font.name
size = run.font.size
highlight = run.font.highlight_color
# Store the XML element reference for later removal
delete_element = run._element
run._element accesses python-docx’s internal lxml element — a private API. This is necessary because python-docx doesn’t expose a public method to remove a run from a paragraph. Using private APIs means this code could break in a future python-docx release. The tradeoff is accepted because there’s no public alternative.
The Split Decision
if current_pos <= position < current_pos + run_length:
offset = position - current_pos
before_text = run.text[:offset]
affected_text = run.text[offset:offset + length]
after_text = run.text[offset + length:]
if before_text:
new_run = para.add_run(before_text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
if affected_text:
highlighted_run = para.add_run(affected_text)
highlighted_run.font.name = font
highlighted_run.font.size = size
highlighted_run.font.highlight_color = color # The highlight
if after_text:
new_run = para.add_run(after_text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
The Non-Matching Run Case
If the target doesn’t fall in this run, the run is still cloned and the original removed. The method rebuilds the entire paragraph’s run sequence — it can’t selectively modify one run and leave others untouched, because para.add_run() appends to the end of the run list. So every run gets cloned in order.
else:
new_run = para.add_run(run.text)
new_run.font.name = font
new_run.font.size = size
new_run.font.highlight_color = highlight
current_pos += run_length
para._element.remove(delete_element)
para._element.remove(delete_element) operates at the lxml etree level. lxml elements are nodes in a tree; .remove() detaches the node from its parent without destroying it — Python’s garbage collector handles that. The new runs added via para.add_run() are appended as new <w:r> children of the <w:p> element. The net effect: the original XML is replaced with a semantically identical but structurally refined version.
One risk worth knowing: if para._element.remove(delete_element) is called before the new runs are appended, an exception during add_run() would leave the paragraph with missing text. The current implementation always appends new runs before removing the original — but this relies on para.add_run() never throwing. If the document’s XML tree is malformed, lxml could raise an XMLSyntaxError during add_run(), leaving both the original and partial new runs in the paragraph.
The Audio Normalization Pipeline
The Format Problem
Whisper accepts exactly one input format: a 1D float32 NumPy array at 16,000 samples per second. Browser audio can arrive in any combination of:
| Property | Possible Values |
|---|---|
| Channels | Mono (1D) or Stereo (2D) |
| Data type | int16, int32, uint8, float32, float64 |
| Sample rate | 8kHz, 22.05kHz, 44.1kHz, 48kHz, or others |
The pipeline handles every combination.
Step 1: Stereo to Mono
sample_rate, data = wavfile.read(save_path)
if len(data.shape) == 2:
data = data.mean(axis=1)
max_val = np.max(np.abs(data))
if max_val > 0:
data = data / max_val
data.mean(axis=1) computes the element-wise mean across the channel dimension. For a stereo array of shape (480000, 2), this produces a mono array of shape (480000,). The normalization step divides by the maximum absolute value to ensure no sample exceeds the [-1, 1] float range — preventing clipping artifacts in the transcription.
Step 2: Data Type Normalization
Each integer format maps to a different divisor:
if data.dtype == np.int16:
data = data.astype(np.float32) / 32768.0
elif data.dtype == np.int32:
data = data.astype(np.float32) / 2147483648.0
elif data.dtype == np.uint8:
# uint8 uses 128 as "silence" rather than 0
data = (data.astype(np.float32) - 128) / 128.0
elif data.dtype == np.float64:
data = data.astype(np.float32)
elif data.dtype != np.float32:
return jsonify({'error': f'Unsupported sample type: {data.dtype}'}), 400
The division constants — 32768, 2147483648, 128 — are 2^15, 2^31, and 2^7: half the range of each integer type. Dividing by these values maps the full integer range to exactly [-1.0, 1.0] in float32.
Step 3: Fourier Resampling
target_sr = 16000
if sample_rate != target_sr:
new_len = int(len(data) * target_sr / sample_rate)
data = resample(data, new_len)
sample_rate = target_sr
scipy.signal.resample uses the Fourier method: it transforms the signal into the frequency domain via FFT, truncates or zero-pads the frequency components to match the target length, and transforms back via inverse FFT. This preserves all frequency content below the Nyquist limit of the target sample rate (8kHz for 16kHz sampling), producing better results than linear interpolation.
Step 4: Model Invocation
model = current_app.whisper_model
if model is None:
return jsonify({"error": "Transcription service unavailable"}), 503
result = model.transcribe(audio=data, language='en', fp16=False)
return jsonify({
"transcription": result["text"],
"segments": result["segments"]
})
The finally block ensures the temporary upload file is deleted even if transcription fails:
finally:
try:
if os.path.exists(save_path):
os.remove(save_path)
except Exception as cleanup_error:
current_app.logger.warning(f"Failed to cleanup: {cleanup_error}")
The Memory Spike You Need to Plan For
For a 5-minute stereo recording at 48kHz/16-bit:
| Step | Array Shape | Dtype | Memory |
|---|---|---|---|
| Raw read | (14,400,000, 2) | int16 | ~55 MB |
After .mean(axis=1) | (14,400,000,) | float64 | ~110 MB |
After .astype(float32) | (14,400,000,) | float32 | ~55 MB |
After resample() | (4,800,000,) | float64 | ~37 MB |
The spike at the mean step is the one that bites: NumPy’s mean() on an int16 array returns float64 by default (to avoid overflow during summation). A 55MB int16 array becomes a 110MB float64 array. For a 500MB media file, this spike reaches 1GB. Combined with the FFT buffers that resample() allocates internally, peak memory for the audio pipeline can reach 3-4x the input file size.
This is why Gunicorn is configured with only 4 workers: each worker may consume up to 2GB during audio processing, and the server needs headroom for the Whisper model itself (~1.5GB for the “medium” model loaded at startup). The --timeout 600 flag accommodates both model loading (30-60 seconds cold start) and long transcription jobs.
Whisper’s preload_app option in Gunicorn (not currently used in STORM DAT) could share the model across workers via copy-on-write memory in forked processes — reducing total memory from 6GB (4 copies) to ~1.5GB (1 shared copy). The tradeoff is incompatibility with some debugging and reload workflows.
What Both Engines Have in Common
The unifying skill is structural transformation with invariant preservation:
Capture state before mutation. Both algorithms snapshot the original properties before modifying anything — font name and size for runs, sample rate and dtype for audio. This “capture-transform-apply” pattern ensures no information is lost during the transformation.
Rebuild, don’t patch. The run-splitting algorithm doesn’t try to modify a run in place — it rebuilds the paragraph’s run sequence. The audio pipeline doesn’t try to modify int16 samples in place — it creates a new float32 array. Rebuilding is more memory-expensive but eliminates an entire class of partial-mutation bugs.
Clean up unconditionally. Both engines use cleanup mechanisms that execute regardless of success or failure. Leaked resources — orphaned XML elements, abandoned temporary files — are treated as bugs, not acceptable edge cases.
Understand your library’s internal model. The run-splitting algorithm works because I understood that python-docx runs are backed by lxml elements, and that para.add_run() appends to the XML tree. The audio pipeline works because I understood that NumPy’s mean() upcasts to float64 and that SciPy’s resample uses FFT internally. Surface-level API knowledge isn’t enough for this kind of work.