featured image

One Dictionary, Seven Colors, and 200 Paragraphs: Engineering a Document Compliance Engine

A deep dive into the acronym sweep engine of STORM DAT, a Python-based document analysis tool that uses regex and state machines to enforce complex compliance rules across multi-page Word documents.

Published

Thu Feb 19 2026

Technologies Used

Python Flask
Intermediate 18 minutes

The Problem

Government and defense organizations produce technical documents under strict formatting mandates. Every acronym must be defined on first use. Security classification banners must appear in every header and footer. Body text must be 12-point Arial — no exceptions. Tables must be 10-point Arial. An acronym defined once must never be fully spelled out again.

Enforcing these rules manually means a senior analyst opens a 60-page Word document and reads every paragraph, checking each acronym against a master spreadsheet, verifying font properties run by run, and flagging deviations with colored highlights. This process takes hours per document and is still error-prone — human reviewers miss things, especially on page 47 of a 60-page report.

The Solution

We will walk through the acronym_sweep() method in STORM DAT’s WordAnalyzer class — a single function that ingests a Word document and an acronym reference list, then produces a fully annotated document with color-coded highlights and a structured findings report. The engine uses regex pattern matching with word boundaries, a dictionary-based state machine to track acronym lifecycle, and a multi-pass architecture that handles paragraphs, tables, headers, and footers in distinct phases.

By the end, you will understand how to model stateful document analysis in Python using nothing but the standard library’s re module and python-docx.

Prerequisites: Regex, python-docx Internals, and Dictionary-as-State

Knowledge Base

  • Regex fundamentals: character classes, word boundaries (\b), re.finditer() vs re.search()
  • python-docx object model: Document, Section, Paragraph, Run — and how Word stores text as a sequence of runs within paragraphs
  • Dictionary patterns: using dicts as lookup tables and as simple state machines

Environment

  • Python 3.12
  • python-docx 1.1.2 — Word document manipulation
  • Pandas 2.2.3 — to load the acronym Excel reference file
  • openpyxl 3.1.5 — pandas’ backend for .xlsx reading
pip install python-docx==1.1.2 pandas==2.2.3 openpyxl==3.1.5

The Five-Phase Pipeline: How a Document Becomes a Findings Report

Think of the acronym sweep as an assembly line where the document moves through five inspection stations. Each station has a specific job and passes its findings downstream.

flowchart LR
    A[Phase 1\nSecurity Markings] --> B[Phase 2\nFont Inspection]
    B --> C[Phase 3\nAcronym Lifecycle]
    C --> D[Phase 4\nUnknown Acronym\nDetection]
    D --> E[Phase 5\nTable Inspection]

    F[acronym_usage dict] -.->|State carried across phases| C
    G[findings list] -.->|Appended at every phase| A
    G -.-> B
    G -.-> C
    G -.-> D
    G -.-> E

    style F fill:#ffffcc,stroke:#333
    style G fill:#ccffcc,stroke:#333

Analogy: Imagine a quality control conveyor belt in a factory. The document is the product. Each station inspector (phase) examines one aspect — markings, fonts, acronyms — stamps a colored flag on any defect, and writes a note on a shared clipboard (the findings list). The acronym_usage dictionary is the memory that travels with the product, recording which acronyms have been seen and whether they have been reused.

Walking Through the Engine: From Raw Document to Annotated Output

Phase 0 — Data Preparation

Before any analysis begins, the method converts the acronym Excel file (loaded as a Pandas DataFrame) into a flat Python dictionary. This is not just a convenience — it is a performance decision.

def acronym_sweep(self, doc, acronyms):
    # Convert DataFrame to dict for O(1) lookups instead of O(n) row scans
    acronyms = acronyms.set_index('Acronym')['Definition'].to_dict()

    # State machine: tracks each acronym's lifecycle
    acronym_usage = {}

    # Accumulator: every finding appended here
    findings = []

    # Words to ignore when detecting potential new acronyms
    ignore_words = {
        'of', 'and', 'the', 'in', 'for',
        'on', 'to', 'with', 'by', 'as', 'at', 'from', '&'
    }

🔵 Deep Dive: set_index('Acronym')['Definition'].to_dict() is a Pandas idiom that pivots a two-column DataFrame into a {'ABC': 'Alpha Beta Charlie'} dictionary. Dictionary lookups are O(1) amortized, while scanning a DataFrame row-by-row for each acronym match would be O(n). For a document with 200 paragraphs and an acronym list of 150 entries, this avoids approximately 30,000 DataFrame scans.

Phase 1 — Security Marking Validation

The engine iterates through every section of the Word document, inspecting both headers and footers for approved classification banners.

    valid_security = config.ConfigOptionsWord["SECURITY_MARKINGS"]
    # Valid markings: ["CUI", "[CUI]", "UNCLASSIFIED", "TOP SECRET", ...]

    for section in doc.sections:
        for paragraph in section.header.paragraphs:
            # Check if ANY paragraph in the header matches a valid marking
            if all(
                paragraph.text.strip() not in valid_security
                for paragraph in section.header.paragraphs
            ):
                findings.append(
                    f"INVALID HEADER SECURITY MARKINGS: {paragraph.text.strip()}"
                )

The all() comprehension checks whether no paragraph in the header contains a valid marking. If the entire header lacks any recognized banner, the finding is recorded. The same logic runs for footers.

The marking list itself lives in config.py — externalized so that adding a new classification level (e.g., a NATO marking) requires changing one line in configuration, not modifying analysis logic.

Phase 2 — Font and Styling Inspection

For every paragraph, the engine walks through each run (a contiguous text segment with uniform formatting) and checks two properties: font name and font size.

    for para in doc.paragraphs:
        # ... heading detection omitted for clarity ...

        for run in para.runs:
            # 152400 EMU = 12pt (Word uses English Metric Units internally)
            if (run.font.size and run.font.size != 152400) or \
               (run.font.name and run.font.name != 'Arial'):

                message = f"(Pink) Paragraph {para_index}, position "\
                          f"{para.text.find(run.text)}. {run.text} "

                # Direct highlight — no run splitting needed for full-run issues
                run.font.highlight_color = WD_COLOR_INDEX.PINK

                if run.font.size and run.font.size != 152400:
                    message += f"(Font Size = {run.font.size / 12700}pt)"
                if run.font.name and run.font.name != 'Arial':
                    message += f"(Font = {run.font.name})"

                findings.append(message)

🔵 Deep Dive: Word’s internal unit is the EMU (English Metric Unit), where 1 point = 12,700 EMU. So 12pt = 152,400 EMU and 10pt (for tables) = 127,000 EMU. This is why the comparison checks != 152400 rather than != 12. If you are new to python-docx, this unit system is one of the most common sources of confusion.

When a font violation is found in a complete run, the engine applies the highlight directly to that run via WD_COLOR_INDEX.PINK. This is simpler than the partial-run highlighting covered in Tutorial 3.

Phase 3 — The Acronym Lifecycle State Machine

This is the heart of the engine. For each paragraph, the method checks every acronym in the reference dictionary against the paragraph text using regex.

The state machine has three states per acronym:

StateMeaningStored Value
Not in acronym_usageNever seen in the document yet(key absent)
'first'Defined once but never reused afterward'first'
'used'Has been used at least once after its definition'used'
        for acronym, full_form in acronyms.items():
            # Word-boundary regex prevents "IT" from matching inside "ITEM"
            acronym_pattern = r'\b' + re.escape(acronym.strip()) + r'\b'
            full_form_pattern = r'\b' + re.escape(full_form.strip()) + r'\b'

            for match in re.finditer(acronym_pattern, para.text):
                position = match.start()

                if acronym not in acronym_usage:
                    # FIRST ENCOUNTER — acronym has never appeared before
                    if not re.search(full_form_pattern, para.text):
                        # Full definition is NOT in this paragraph → violation
                        # Yellow: "First instance of ABC should have definition"
                        findings.append(f"(Yellow) ...")
                        self.highlight_text(
                            para, position, len(acronym),
                            WD_COLOR_INDEX.DARK_YELLOW
                        )
                    acronym_usage[acronym] = 'first'

                else:
                    # SUBSEQUENT ENCOUNTER — acronym was already defined
                    if re.search(full_form_pattern, para.text):
                        # Full definition IS in this paragraph → redundant
                        # Teal: "Already defined, replace with acronym"
                        findings.append(f"(Teal) ...")
                        self.highlight_text(
                            para, para.text.find(full_form),
                            len(full_form), WD_COLOR_INDEX.TEAL
                        )
                    acronym_usage[acronym] = 'used'

The second loop catches the inverse case — where the full-form definition appears in a paragraph but the acronym was already defined earlier:

            for match in re.finditer(full_form_pattern, para.text):
                position = match.start()
                if acronym in acronym_usage and \
                   acronym not in acronym_found_in_para:
                    # Green: "Should be replaced with acronym in subsequent uses"
                    findings.append(f"(Green) ...")
                    self.highlight_text(
                        para, position, len(full_form),
                        WD_COLOR_INDEX.BRIGHT_GREEN
                    )

The acronym_found_in_para guard prevents false positives: if both the acronym and its full form appear in the same paragraph (as they should on first definition), the full-form match should not be flagged.

Phase 4 — Unknown Acronym Detection

After checking known acronyms, the engine scans for uppercase words that are not in the reference list — potential undocumented acronyms:

        for position, word in enumerate(para.text.split()):
            cleaned_word = re.sub(r'[^\w/&]', '', word)

            if cleaned_word.isupper() and \
               cleaned_word not in acronyms and \
               len(cleaned_word) > 1:
                # Violet: "Found acronym not in acronym list"
                findings.append(f"(Violet) ...")
                self.highlight_text(...)

It also detects potential new acronyms — sequences of capitalized words that look like they could be abbreviated (e.g., “Software Simulation Systems Engineering” might become “S3E”). The algorithm tracks consecutive capitalized words, skips common filler words (of, and, the), and flags any sequence of two or more capitalized words whose combined form is not already a known definition.

Phase 5 — Post-Analysis and Table Inspection

After all paragraphs are processed, the engine performs two reconciliation checks:

    # Acronyms in the reference list but never found in the document
    for acronym in acronyms.keys():
        if acronym not in acronym_usage:
            findings.append(f"Acronym {acronym} in table but not used")

    # Acronyms found once but never reused
    for acronym, usage in acronym_usage.items():
        if usage == 'first':
            findings.append(f"Acronym {acronym} defined but not used")

Tables are then processed separately with their own font rules (10pt Arial = 127,000 EMU instead of 12pt = 152,400 EMU).

Why Word Boundaries Change Everything: The Regex Precision Layer

The Problem Without \b

Consider an acronym list containing “IT” (Information Technology). Without word boundaries, the regex IT would match inside “ITEM”, “ITERATION”, “SUITABLE”, and “COMMIT” — producing dozens of false positives per page.

The Solution With \b

The pattern \b + re.escape(acronym) + \b anchors the match to word boundaries. In Python’s re engine, \b matches the zero-width position between a word character (\w: letters, digits, underscore) and a non-word character. So \bIT\b matches “IT” as a standalone word but not “ITEM”.

re.escape() is equally important: if an acronym contains regex metacharacters (e.g., “C++”), re.escape converts it to C\+\+, preventing the + from being interpreted as a quantifier.

Complexity

For a document with P paragraphs and A acronyms, the inner loop executes P x A regex searches. Each re.finditer() call scans the paragraph text in O(L) time where L is the paragraph length. Total complexity: O(P x A x L).

For a typical 60-page document (~600 paragraphs, average length 500 characters) and 150 acronyms, this is approximately 45 million character comparisons — which completes in under a second on modern hardware because regex matching is implemented in C within Python’s re module.

🔵 Deep Dive: Python’s re module compiles regex patterns into bytecode that runs on a specialized virtual machine. For simple patterns like \bWORD\b, the engine can use a fast literal string search (similar to Boyer-Moore) before applying the boundary checks — so the actual performance is much better than the theoretical O(L) per pattern.

When the State Machine Breaks: Concurrent Access, Edge Cases, and Silent Failures

The “Definition in a Different Section” Problem

The state machine tracks acronyms globally across the entire document. But what if a document defines an acronym in Section 3, and the same acronym appears undefined in Section 1 (which the reader encounters first)? The engine processes paragraphs in document order, so it will flag the Section 1 usage as “first instance without definition” — which is the correct finding. The reading order matches the analysis order.

Duplicate Definitions in the Same Paragraph

If a paragraph contains both “Information Technology (IT)” and then “IT” again, the acronym_found_in_para dictionary prevents the second occurrence from being flagged as a redundant definition. Without this guard, the engine would produce a false-positive Teal finding on every first-definition paragraph.

Concurrent Request Safety

WordAnalyzer is instantiated fresh in every request (WordAnalyzer().acronym_sweep(...)). The acronym_usage dictionary, the findings list, and all other state are local to the method call. Two users analyzing different documents simultaneously in separate Gunicorn workers share no state — each worker has its own process memory space. This stateless design eliminates race conditions entirely.

🔴 Danger: If WordAnalyzer were a singleton with acronym_usage stored as an instance variable persisted across requests, one user’s document would contaminate the next user’s analysis. The per-request instantiation pattern is not accidental — it is a concurrency safety mechanism.

The re.escape Safety Net

If the acronym list contains an entry like C++ or TCP/IP, the regex pattern without escaping would be \bC++\b (invalid quantifier) or \bTCP/IP\b (forward slash ambiguity in some regex flavors). re.escape turns these into \bC\+\+\b and \bTCP/IP\b, preventing regex compilation errors from crashing the entire analysis.

You Now Know How to Build a Stateful Document Analysis Engine

The core skill from this tutorial is modeling document compliance as a state machine driven by regex pattern matching. Specifically:

  1. State machines do not need frameworks. A Python dictionary with two possible values ('first' and 'used') is a perfectly valid state machine when the state transitions are simple and deterministic.

  2. Word boundaries are non-negotiable in text analysis. Without \b, any acronym matching engine will drown in false positives. This applies beyond document analysis — search engines, linters, and code analyzers all rely on boundary-aware matching.

  3. Configuration-driven rules scale. By externalizing security markings, font requirements, and the acronym reference list, the engine can adapt to different organizations’ standards without code changes. The analysis logic is policy-agnostic; the configuration is the policy.

  4. Multi-pass architecture separates concerns. Each phase (markings, fonts, acronyms, unknowns, tables) can be understood, tested, and modified independently. A new rule type (e.g., “check for passive voice”) would be a new phase appended to the pipeline, not a modification to an existing one.

We respect your privacy.

← View All Tutorials

Related Projects

    Ask me anything!