STORM DAT: Automating Government Document Compliance So Analysts Can Focus on What Matters

When a Single Misplaced Marking Can Halt a Deliverable

Anyone who’s worked inside a government or defense environment knows the weight a document carries. A technical report destined for a program office isn’t just prose — it’s a controlled artifact. Security markings must appear in precise locations. Every acronym must be defined before its first abbreviated use, never duplicated, never orphaned. Body text must conform to specific font families and point sizes. Headers and footers carry classification banners that, if absent or incorrect, can delay an entire review cycle.

In large-scale defense programs, formal qualification testing generates enormous volumes of data. Test Summary files, Run for Record logs, and Comment Logs arrive in slightly different Excel formats with inconsistent column headers, free-text failure comments, and issue tracker references buried in narrative prose. An analyst producing a VV&A report must manually sift through these, extract every defect ID, cross-reference against Azure DevOps for priority, categorize each failure by root cause, and compile pass/fail statistics — all while ensuring nothing slips through. It’s tedious, error-prone, and it pulls experienced engineers away from the analytical work they were actually hired to do.

STORM DAT automates that burden.

What the Platform Does

The tool replaces manual compliance review with a structured, repeatable analysis pipeline across several distinct functions:

Security marking validation inspects every section header and footer against a configurable set of approved classification banners — CUI, UNCLASSIFIED, SECRET, TOP SECRET, and their variants — surfacing any deviations immediately.

Typographic standards enforcement checks each paragraph and table cell against mandated font and sizing rules (12-point Arial for body text, 10-point Arial for tables), flagging non-conforming runs at the exact character position.

Acronym lifecycle auditing cross-references document content against a provided acronym reference list, catching duplicate definitions, undefined acronyms, abbreviations used before their full-form introduction, and potential new acronyms not yet cataloged.

Automated defect extraction scans free-text failure comments using 29 distinct issue-type classifiers, extracts Helix and ADO defect IDs via pattern recognition, and fetches each work item’s priority directly from Azure DevOps. What would take hours to compile by hand comes back as a fully cross-referenced issue table.

Missile test data alignment handles the comparisons no spreadsheet tool does well — when aligning two or three flight test logs side-by-side, the system uses similarity-scored fuzzy matching to pair rows that represent the same test event even when naming conventions differ between files. It computes timing deltas down to the microsecond and flags any variance exceeding a defined threshold.

There’s also a built-in screen recording and transcription module. It captures audio during test execution or walkthroughs and routes it through a locally hosted OpenAI Whisper model to produce timestamped transcriptions — searchable text records of verbal observations that would otherwise be lost.

Stack and Reasoning

Layer	Technology
Backend	Python 3.12, Flask, Gunicorn
Data Processing	Pandas, NumPy, SciPy
Visualization	Matplotlib
Document I/O	python-docx, openpyxl, XlsxWriter
AI/ML	OpenAI Whisper, PyTorch
External Integration	Azure DevOps REST API
Infrastructure	Docker, HTTPS/SSL, GitLab CI

Python and Flask made sense here because the application’s value lives entirely in its backend processing logic. Flask’s minimal footprint keeps the dependency surface small — a real advantage when the target environment may be an air-gapped network where every dependency must be vetted. Python’s document manipulation ecosystem (python-docx, openpyxl, Pandas) is simply unmatched for programmatic Word and Excel processing.

Whisper runs locally, not through a cloud endpoint. This was a deliberate call driven by operational constraints: classified networks often can’t reach external APIs. By loading the model once at application startup and holding it as a singleton, the application eliminates network dependency for transcription while amortizing the expensive model-load cost across all session requests.

The deployment configuration sets a 600-second Gunicorn worker timeout to accommodate long-running transcription jobs without premature process kills. Combined with four workers and two threads, this balances concurrency for lightweight document analysis against the heavier demands of media processing.

The entire analysis pipeline operates on DataFrames. Test data arrives as Excel, gets parsed into tabular structures, filtered and categorized through vectorized operations, and written back as formatted Excel. Pandas is the right tool for this kind of columnar data transformation, and its tight integration with both openpyxl and XlsxWriter means the platform can round-trip Excel files without losing fidelity.

Matplotlib’s agg backend eliminates any dependency on a display server, making visualization equally reliable in a Docker container, a CI pipeline, or a headless production server.

The Hardest Part: Annotating Word Documents at the Run Level

The most technically demanding part of the project isn’t the analysis — it’s the annotation. Word documents store text internally as “runs,” which are contiguous segments sharing identical formatting. Run boundaries rarely align with finding boundaries. That mismatch creates a real problem.

Consider a paragraph where only a single acronym — three characters buried in the middle of a 200-character run — needs to be highlighted yellow. Highlighting the entire run would produce a misleading result. The correct approach requires surgically splitting the run into three segments (before, target, after), applying the highlight only to the target segment, preserving every formatting property (font family, size, color, bold, italic) across all three new runs, and then removing the original run from the document’s XML tree — without corrupting the document structure.

I built a run-splitting algorithm that maintains a character-position counter as it walks through each run in a paragraph, identifies the exact run and offset where a finding begins and ends, reconstructs the run sequence with new segments inserted, and transfers formatting attributes programmatically. The result looks as though a human reviewer placed each highlight by hand — produced in milliseconds across hundreds of findings.

What Building This Taught Me

Externalizing security markings, allowed file types, size limits, and environment configurations into a dedicated configuration module was one of the first decisions I made, and it paid dividends on every subsequent feature. When a new classification banner needed support, it was a one-line change, not a code modification requiring regression testing.

Security isn’t a feature to add before deployment — in a tool handling controlled documents, it has to be architectural. Input validation, filename sanitization, HTML escaping, Content Security Policy headers, and restricted file-type whitelists are woven into the middleware and utility layers from the ground up. No single layer’s failure should expose the system.

The decision to integrate screen recording and transcription alongside document analysis wasn’t a technical exercise — it was a product observation. Analysts who review documents also conduct walkthroughs, record demonstrations, and produce meeting notes. Housing both capabilities in one platform consolidates the analyst’s digital workspace and cuts context-switching. The best software doesn’t just automate a task; it reshapes the workflow around it.

Test data formats in defense programs aren’t standardized — they vary by contractor, by program phase, and sometimes by individual engineer. Rather than enforcing a strict input schema and rejecting anything that doesn’t conform, the platform uses a dictionary of column header variants for each logical field. A “pass/fail” column might be labeled “Test Pass/Fail,” “Overall Pass Fail,” or “PrSM 103.4735 Test SIT Pass/Fail,” and the system recognizes all of them. Match what you can, log what you can’t, never crash.

What Comes Next

Batch processing is the most obvious gap. The current architecture handles one document at a time. Supporting batch uploads with a consolidated findings dashboard — one that highlights cross-document inconsistencies like an acronym defined differently in two companion documents — would make the tool useful for reviewing entire deliverable packages at once.

The compliance rules are currently embedded in the analysis engine. Exposing a rule-builder interface where users define their own font requirements, marking formats, or acronym policies would transform the tool from a single-purpose analyzer into a configurable compliance platform adaptable to any organization’s document standards.

The codebase already contains scaffolding for a ChromaDB-backed vector store with sentence-transformer embeddings. The logical next step is connecting this to the analysis pipeline so engineers can semantically search across historical test results, finding similar failure patterns and past resolutions without knowing the exact terminology.

And currently each analysis session is stateless — findings are generated, downloaded, and gone. A lightweight persistence layer would enable trend tracking: whether a team’s document quality improves across successive drafts, which error categories are most common, where targeted training might help.