featured image

One Dictionary, Seven Colors, and 200 Paragraphs: Engineering a Document Compliance Engine

A deep dive into the acronym sweep engine of STORM DAT, a Python-based document analysis tool that uses regex and state machines to enforce complex compliance rules across multi-page Word documents.

Published

Thu Feb 19 2026

Technologies Used

Python Flask
Intermediate 18 minutes

Government and defense organizations produce technical documents under strict formatting mandates. Every acronym must be defined on first use. Security classification banners must appear in every header and footer. Body text must be 12-point Arial — no exceptions. Tables must be 10-point Arial. An acronym defined once must never be fully spelled out again.

Enforcing these rules manually means a senior analyst opens a 60-page Word document and reads every paragraph, checking each acronym against a master spreadsheet, verifying font properties run by run, flagging deviations with colored highlights. This takes hours per document and is still error-prone — human reviewers miss things, especially on page 47 of a 60-page report.

We’ll walk through the acronym_sweep() method in STORM DAT’s WordAnalyzer class — a single function that ingests a Word document and an acronym reference list, then produces a fully annotated document with color-coded highlights and a structured findings report. The engine uses regex pattern matching with word boundaries, a dictionary-based state machine to track acronym lifecycle, and a multi-pass architecture that handles paragraphs, tables, headers, and footers in distinct phases.

Before Starting

Knowledge required:

  • Regex fundamentals: character classes, word boundaries (\b), re.finditer() vs re.search()
  • python-docx object model: Document, Section, Paragraph, Run — and how Word stores text as runs within paragraphs
  • Dictionary patterns: using dicts as lookup tables and as simple state machines

Environment:

  • Python 3.12, python-docx 1.1.2, Pandas 2.2.3, openpyxl 3.1.5
pip install python-docx==1.1.2 pandas==2.2.3 openpyxl==3.1.5

An Assembly Line With a Shared Clipboard

Think of the acronym sweep as an assembly line where the document moves through five inspection stations. Each station has a specific job and passes its findings downstream.

flowchart LR
    A[Phase 1\nSecurity Markings] --> B[Phase 2\nFont Inspection]
    B --> C[Phase 3\nAcronym Lifecycle]
    C --> D[Phase 4\nUnknown Acronym\nDetection]
    D --> E[Phase 5\nTable Inspection]

    F[acronym_usage dict] -.->|State carried across phases| C
    G[findings list] -.->|Appended at every phase| A
    G -.-> B
    G -.-> C
    G -.-> D
    G -.-> E

    style F fill:#ffffcc,stroke:#333
    style G fill:#ccffcc,stroke:#333

The findings list is a shared clipboard — every station appends its findings there. The acronym_usage dictionary is the memory that travels with the document, recording which acronyms have been seen and whether they’ve been reused.

Phase 0: Data Preparation

Before any analysis, the method converts the Excel acronym reference (loaded as a Pandas DataFrame) into a flat Python dictionary:

def acronym_sweep(self, doc, acronyms):
    acronyms = acronyms.set_index('Acronym')['Definition'].to_dict()

    acronym_usage = {}
    findings = []
    ignore_words = {
        'of', 'and', 'the', 'in', 'for',
        'on', 'to', 'with', 'by', 'as', 'at', 'from', '&'
    }

set_index('Acronym')['Definition'].to_dict() pivots a two-column DataFrame into a {'ABC': 'Alpha Beta Charlie'} dictionary. Dictionary lookups are O(1) amortized. Scanning a DataFrame row-by-row for each acronym match would be O(n). For a document with 200 paragraphs and 150 acronyms, the dictionary approach avoids roughly 30,000 DataFrame scans.

Phase 1: Security Marking Validation

The engine iterates through every section, inspecting headers and footers for approved classification banners:

    valid_security = config.ConfigOptionsWord["SECURITY_MARKINGS"]

    for section in doc.sections:
        for paragraph in section.header.paragraphs:
            if all(
                paragraph.text.strip() not in valid_security
                for paragraph in section.header.paragraphs
            ):
                findings.append(
                    f"INVALID HEADER SECURITY MARKINGS: {paragraph.text.strip()}"
                )

The all() comprehension checks whether no paragraph in the header contains a valid marking. If the entire header lacks any recognized banner, the finding is recorded. The same logic runs for footers.

The valid markings list lives in config.py — externalized so that adding a new classification level requires changing one line in configuration, not modifying analysis logic.

Phase 2: Font and Styling Inspection

For every paragraph, the engine walks through each run (a contiguous text segment with uniform formatting) and checks font name and size:

    for para in doc.paragraphs:
        for run in para.runs:
            # 152400 EMU = 12pt (Word uses English Metric Units internally)
            if (run.font.size and run.font.size != 152400) or \
               (run.font.name and run.font.name != 'Arial'):

                run.font.highlight_color = WD_COLOR_INDEX.PINK

                message = f"(Pink) Paragraph {para_index}, position "\
                          f"{para.text.find(run.text)}. {run.text} "

                if run.font.size and run.font.size != 152400:
                    message += f"(Font Size = {run.font.size / 12700}pt)"
                if run.font.name and run.font.name != 'Arial':
                    message += f"(Font = {run.font.name})"

                findings.append(message)

Word’s internal unit is the EMU (English Metric Unit), where 1 point = 12,700 EMU. So 12pt = 152,400 EMU and 10pt (for tables) = 127,000 EMU. This is why the comparison checks != 152400 rather than != 12. If you’re new to python-docx, the EMU system is one of the most common sources of confusion — it’s not documented prominently.

Phase 3: The Acronym Lifecycle State Machine

This is the heart of the engine. The state machine has three states per acronym:

StateMeaningStored Value
Not in acronym_usageNever seen in the document(key absent)
'first'Defined once but never reused'first'
'used'Used at least once after its definition'used'
        for acronym, full_form in acronyms.items():
            acronym_pattern = r'\b' + re.escape(acronym.strip()) + r'\b'
            full_form_pattern = r'\b' + re.escape(full_form.strip()) + r'\b'

            for match in re.finditer(acronym_pattern, para.text):
                position = match.start()

                if acronym not in acronym_usage:
                    # First encounter
                    if not re.search(full_form_pattern, para.text):
                        # Full definition not in this paragraph — violation
                        findings.append(f"(Yellow) ...")
                        self.highlight_text(para, position, len(acronym),
                                            WD_COLOR_INDEX.DARK_YELLOW)
                    acronym_usage[acronym] = 'first'

                else:
                    # Subsequent encounter — already defined
                    if re.search(full_form_pattern, para.text):
                        # Full definition appears again — redundant
                        findings.append(f"(Teal) ...")
                        self.highlight_text(para, para.text.find(full_form),
                                            len(full_form), WD_COLOR_INDEX.TEAL)
                    acronym_usage[acronym] = 'used'

A second loop catches the inverse case — full-form definition appearing in a paragraph where the acronym was already defined earlier:

            for match in re.finditer(full_form_pattern, para.text):
                position = match.start()
                if acronym in acronym_usage and \
                   acronym not in acronym_found_in_para:
                    findings.append(f"(Green) ...")
                    self.highlight_text(para, position, len(full_form),
                                        WD_COLOR_INDEX.BRIGHT_GREEN)

The acronym_found_in_para guard prevents false positives: if both the acronym and its full form appear in the same paragraph (as they should on first definition), the full-form match shouldn’t be flagged. This avoids highlighting the definition itself as a violation.

Phase 4: Unknown Acronym Detection

After checking known acronyms, the engine scans for uppercase words not in the reference list:

        for position, word in enumerate(para.text.split()):
            cleaned_word = re.sub(r'[^\w/&]', '', word)

            if cleaned_word.isupper() and \
               cleaned_word not in acronyms and \
               len(cleaned_word) > 1:
                findings.append(f"(Violet) ...")
                self.highlight_text(...)

It also detects potential new acronyms — sequences of capitalized words that look like they could be abbreviated (“Software Simulation Systems Engineering” → “S3E”). The algorithm tracks consecutive capitalized words, skips common filler words (of, and, the), and flags any sequence of two or more capitalized words whose combined form isn’t already a known definition.

Phase 5: Reconciliation and Table Inspection

After all paragraphs, two reconciliation checks run:

    for acronym in acronyms.keys():
        if acronym not in acronym_usage:
            findings.append(f"Acronym {acronym} in table but not used")

    for acronym, usage in acronym_usage.items():
        if usage == 'first':
            findings.append(f"Acronym {acronym} defined but not used")

Tables are processed separately with their own font rules — 10pt Arial (127,000 EMU) instead of 12pt.

Why Word Boundaries Change Everything

Without \b, a regex for “IT” (Information Technology) would match inside “ITEM”, “ITERATION”, “SUITABLE”, and “COMMIT” — dozens of false positives per page.

The pattern \b + re.escape(acronym) + \b anchors the match to word boundaries. In Python’s re engine, \b matches the zero-width position between a word character and a non-word character. \bIT\b matches “IT” as a standalone word but not “ITEM”.

re.escape() is equally important. If an acronym contains regex metacharacters — “C++”, “TCP/IP” — re.escape converts them to C\+\+ and TCP/IP, preventing compilation errors from crashing the entire analysis.

For a 60-page document (~600 paragraphs, average length 500 characters) and 150 acronyms, the inner loop executes roughly 45 million character comparisons. This completes in under a second on modern hardware because Python’s re module compiles patterns to bytecode running on a specialized virtual machine, and for simple patterns like \bWORD\b it uses a fast literal string search before applying boundary checks.

Where the State Machine Can Break

Definition in a different section than first use. The state machine tracks acronyms globally across the document and processes paragraphs in document order. If an acronym appears undefined in Section 1 but is defined in Section 3, the engine correctly flags the Section 1 usage — the reading order matches the analysis order.

Duplicate definitions in the same paragraph. The acronym_found_in_para guard prevents a first-definition paragraph (which contains both the acronym and its full form) from generating a false-positive Teal finding on the full-form match. Without this guard, every correctly-formatted first definition would also produce a violation.

Concurrent request safety. WordAnalyzer is instantiated fresh in every request. The acronym_usage dictionary, the findings list, and all other state are local to the method call. Two users analyzing different documents simultaneously in separate Gunicorn workers share no state — each worker has its own process memory. If WordAnalyzer were a singleton with instance-level state persisted across requests, one user’s document would contaminate the next user’s analysis. The per-request instantiation is not accidental — it’s a concurrency safety mechanism.

The re.escape safety net. Without escaping, an acronym like C++ would produce the invalid pattern \bC++\b (invalid quantifier) and crash the entire analysis. re.escape makes this safe automatically.

Three things I’d take away from this design: a Python dictionary with two possible values is a valid state machine when transitions are simple and deterministic. Word boundaries are non-negotiable in text analysis — without \b, any acronym engine drowns in false positives. And externalizing rules (the valid security markings, font requirements, acronym reference list) into configuration keeps the analysis logic policy-agnostic. Adapting this engine to a different organization’s standards requires changing config, not code.

We respect your privacy.

← View All Tutorials

Related Projects

    Ask me anything!