Splitting Atoms in a Word Document: Surgical Run Manipulation and Real-Time Audio Normalization

Two of the hardest engineering challenges in STORM DAT operate at entirely different layers of the stack, but share a common theme: transforming data in place without corrupting its structure.

Challenge 1 — Document Annotation: When the acronym sweep engine identifies a three-character acronym buried in the middle of a 200-character text run, it must highlight only those three characters. But Word documents don’t have a “highlight character range” API. The smallest unit of formatting is a run — a contiguous text segment with uniform properties. To highlight a subset of a run, you must split it into three new runs, apply the color to the middle segment, preserve all formatting on the outer segments, and remove the original run from the XML tree. One mistake corrupts the document.

Challenge 2 — Audio Normalization: When a user uploads a recording for AI transcription, the audio may arrive as stereo int16 at 44.1kHz, but Whisper demands mono float32 at 16kHz. The pipeline must convert the data type, merge channels, resample the waveform using Fourier interpolation, and do all of this on arrays that can be hundreds of megabytes — without doubling memory consumption or losing audio fidelity.

What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics

Knowledge Base:

python-docx run model: how Paragraph.runs maps to underlying <w:r> XML elements
lxml element manipulation: _element, .remove(), and the difference between Python object references and XML tree nodes
NumPy dtype system: int16, float32, uint8 — how arrays store numbers and what happens during type casting
Signal processing fundamentals: sample rate, Nyquist theorem, and why resampling isn’t simple array slicing

Environment:

Python 3.12
python-docx 1.1.2
NumPy 2.2.5
SciPy 1.15.3
OpenAI Whisper (loaded as a singleton)

pip install python-docx==1.1.2 numpy==2.2.5 scipy==1.15.3 openai-whisper

Both engines perform the same conceptual operation — they’re jewelers resetting a gemstone. The document engine removes a run from its setting (the XML tree), carefully cuts it into pieces, resets each piece with its original properties, and places them back. The audio engine takes a raw recording, reshapes it to fit a specific mount (Whisper’s input format), and polishes it so the model can work with it. In both cases, the transformation must be lossless with respect to what matters: formatting for documents, fidelity for audio.

The Run-Splitting Algorithm

Why Word Documents Make This Hard

A .docx file is a ZIP archive containing XML. The main document body is word/document.xml. Each paragraph is a <w:p> element containing one or more <w:r> (run) elements:

<w:p>
  <w:r>
    <w:rPr><w:rFonts w:ascii="Arial"/><w:sz w:val="24"/></w:rPr>
    <w:t>The Department of Defense published the report.</w:t>
  </w:r>
</w:p>

To highlight only “Defense” (positions 21-28), we can’t color characters 21-28. We must split the single run into three:

"The Department of " | "Defense" | " published the report."

Each new run must carry the original’s formatting (Arial, 12pt), and only the middle one gets the highlight color. Then the original run must be removed from the XML tree.

The Position-Tracking Walker

The highlight_text() method walks through all runs, maintaining a current_pos counter — a manual index into the paragraph’s concatenated text.

def highlight_text(self, para, position, length, color):
    current_pos = 0

    for run in para.runs:
        run_length = len(run.text)

        # Capture formatting BEFORE any mutation
        font = run.font.name
        size = run.font.size
        highlight = run.font.highlight_color

        # Store the XML element reference for later removal
        delete_element = run._element

run._element accesses python-docx’s internal lxml element — a private API. This is necessary because python-docx doesn’t expose a public method to remove a run from a paragraph. Using private APIs means this code could break in a future python-docx release. The tradeoff is accepted because there’s no public alternative.

The Split Decision

        if current_pos <= position < current_pos + run_length:
            offset = position - current_pos

            before_text = run.text[:offset]
            affected_text = run.text[offset:offset + length]
            after_text = run.text[offset + length:]

            if before_text:
                new_run = para.add_run(before_text)
                new_run.font.name = font
                new_run.font.size = size
                new_run.font.highlight_color = highlight

            if affected_text:
                highlighted_run = para.add_run(affected_text)
                highlighted_run.font.name = font
                highlighted_run.font.size = size
                highlighted_run.font.highlight_color = color  # The highlight

            if after_text:
                new_run = para.add_run(after_text)
                new_run.font.name = font
                new_run.font.size = size
                new_run.font.highlight_color = highlight

The Non-Matching Run Case

If the target doesn’t fall in this run, the run is still cloned and the original removed. The method rebuilds the entire paragraph’s run sequence — it can’t selectively modify one run and leave others untouched, because para.add_run() appends to the end of the run list. So every run gets cloned in order.

        else:
            new_run = para.add_run(run.text)
            new_run.font.name = font
            new_run.font.size = size
            new_run.font.highlight_color = highlight

        current_pos += run_length
        para._element.remove(delete_element)

para._element.remove(delete_element) operates at the lxml etree level. lxml elements are nodes in a tree; .remove() detaches the node from its parent without destroying it — Python’s garbage collector handles that. The new runs added via para.add_run() are appended as new <w:r> children of the <w:p> element. The net effect: the original XML is replaced with a semantically identical but structurally refined version.

One risk worth knowing: if para._element.remove(delete_element) is called before the new runs are appended, an exception during add_run() would leave the paragraph with missing text. The current implementation always appends new runs before removing the original — but this relies on para.add_run() never throwing. If the document’s XML tree is malformed, lxml could raise an XMLSyntaxError during add_run(), leaving both the original and partial new runs in the paragraph.

The Audio Normalization Pipeline

The Format Problem

Whisper accepts exactly one input format: a 1D float32 NumPy array at 16,000 samples per second. Browser audio can arrive in any combination of:

Property	Possible Values
Channels	Mono (1D) or Stereo (2D)
Data type	`int16`, `int32`, `uint8`, `float32`, `float64`
Sample rate	8kHz, 22.05kHz, 44.1kHz, 48kHz, or others

The pipeline handles every combination.

Step 1: Stereo to Mono

sample_rate, data = wavfile.read(save_path)

if len(data.shape) == 2:
    data = data.mean(axis=1)
    max_val = np.max(np.abs(data))
    if max_val > 0:
        data = data / max_val

data.mean(axis=1) computes the element-wise mean across the channel dimension. For a stereo array of shape (480000, 2), this produces a mono array of shape (480000,). The normalization step divides by the maximum absolute value to ensure no sample exceeds the [-1, 1] float range — preventing clipping artifacts in the transcription.

Step 2: Data Type Normalization

Each integer format maps to a different divisor:

if data.dtype == np.int16:
    data = data.astype(np.float32) / 32768.0

elif data.dtype == np.int32:
    data = data.astype(np.float32) / 2147483648.0

elif data.dtype == np.uint8:
    # uint8 uses 128 as "silence" rather than 0
    data = (data.astype(np.float32) - 128) / 128.0

elif data.dtype == np.float64:
    data = data.astype(np.float32)

elif data.dtype != np.float32:
    return jsonify({'error': f'Unsupported sample type: {data.dtype}'}), 400

The division constants — 32768, 2147483648, 128 — are 2^15, 2^31, and 2^7: half the range of each integer type. Dividing by these values maps the full integer range to exactly [-1.0, 1.0] in float32.

Step 3: Fourier Resampling

target_sr = 16000
if sample_rate != target_sr:
    new_len = int(len(data) * target_sr / sample_rate)
    data = resample(data, new_len)
    sample_rate = target_sr

scipy.signal.resample uses the Fourier method: it transforms the signal into the frequency domain via FFT, truncates or zero-pads the frequency components to match the target length, and transforms back via inverse FFT. This preserves all frequency content below the Nyquist limit of the target sample rate (8kHz for 16kHz sampling), producing better results than linear interpolation.

Step 4: Model Invocation

model = current_app.whisper_model
if model is None:
    return jsonify({"error": "Transcription service unavailable"}), 503

result = model.transcribe(audio=data, language='en', fp16=False)

return jsonify({
    "transcription": result["text"],
    "segments": result["segments"]
})

The finally block ensures the temporary upload file is deleted even if transcription fails:

finally:
    try:
        if os.path.exists(save_path):
            os.remove(save_path)
    except Exception as cleanup_error:
        current_app.logger.warning(f"Failed to cleanup: {cleanup_error}")

The Memory Spike You Need to Plan For

For a 5-minute stereo recording at 48kHz/16-bit:

Step	Array Shape	Dtype	Memory
Raw read	(14,400,000, 2)	int16	~55 MB
After `.mean(axis=1)`	(14,400,000,)	float64	~110 MB
After `.astype(float32)`	(14,400,000,)	float32	~55 MB
After `resample()`	(4,800,000,)	float64	~37 MB

The spike at the mean step is the one that bites: NumPy’s mean() on an int16 array returns float64 by default (to avoid overflow during summation). A 55MB int16 array becomes a 110MB float64 array. For a 500MB media file, this spike reaches 1GB. Combined with the FFT buffers that resample() allocates internally, peak memory for the audio pipeline can reach 3-4x the input file size.

This is why Gunicorn is configured with only 4 workers: each worker may consume up to 2GB during audio processing, and the server needs headroom for the Whisper model itself (~1.5GB for the “medium” model loaded at startup). The --timeout 600 flag accommodates both model loading (30-60 seconds cold start) and long transcription jobs.

Whisper’s preload_app option in Gunicorn (not currently used in STORM DAT) could share the model across workers via copy-on-write memory in forked processes — reducing total memory from 6GB (4 copies) to ~1.5GB (1 shared copy). The tradeoff is incompatibility with some debugging and reload workflows.

What Both Engines Have in Common

The unifying skill is structural transformation with invariant preservation:

Capture state before mutation. Both algorithms snapshot the original properties before modifying anything — font name and size for runs, sample rate and dtype for audio. This “capture-transform-apply” pattern ensures no information is lost during the transformation.

Rebuild, don’t patch. The run-splitting algorithm doesn’t try to modify a run in place — it rebuilds the paragraph’s run sequence. The audio pipeline doesn’t try to modify int16 samples in place — it creates a new float32 array. Rebuilding is more memory-expensive but eliminates an entire class of partial-mutation bugs.

Clean up unconditionally. Both engines use cleanup mechanisms that execute regardless of success or failure. Leaked resources — orphaned XML elements, abandoned temporary files — are treated as bugs, not acceptable edge cases.

Understand your library’s internal model. The run-splitting algorithm works because I understood that python-docx runs are backed by lxml elements, and that para.add_run() appends to the XML tree. The audio pipeline works because I understood that NumPy’s mean() upcasts to float64 and that SciPy’s resample uses FFT internally. Surface-level API knowledge isn’t enough for this kind of work.

Splitting Atoms in a Word Document: Surgical Run Manipulation and Real-Time Audio Normalization

Technologies Used

What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics

The Run-Splitting Algorithm

Why Word Documents Make This Hard

The Position-Tracking Walker

The Split Decision

The Non-Matching Run Case

The Audio Normalization Pipeline

The Format Problem

Step 1: Stereo to Mono

Step 2: Data Type Normalization

Step 3: Fourier Resampling

Step 4: Model Invocation

The Memory Spike You Need to Plan For

What Both Engines Have in Common

Related Projects

STORM DAT: Automating Government Document Compliance So Analysts Can Focus on What Matters

Splitting Atoms in a Word Document: Surgical Run Manipulation and Real-Time Audio Normalization

Technologies Used

Two Hard Problems That Share One Principle

What This Tutorial Demands: XML Trees, NumPy Internals, and Signal Processing Basics

The Run-Splitting Algorithm

Why Word Documents Make This Hard

The Position-Tracking Walker

The Split Decision

The Non-Matching Run Case

The Audio Normalization Pipeline

The Format Problem

Step 1: Stereo to Mono

Step 2: Data Type Normalization

Step 3: Fourier Resampling

Step 4: Model Invocation

The Memory Spike You Need to Plan For

What Both Engines Have in Common

Related Projects

STORM DAT: Automating Government Document Compliance So Analysts Can Focus on What Matters