Mapping Native ESI to Review Platforms: Fixing Broken Text Extraction & Hash Mismatches

When native ESI is wired into a review platform without a deterministic translation layer, ingestion surfaces a recognizable trio of defects: empty or truncated text panels, SHA256 MISMATCH entries in the processing manifest, and orphaned child documents whose parent container never finished unpacking. These failures land at the mapping stage — after signature detection but before the review platform commits a document record — and they violate the chain-of-custody boundary because a payload that was hashed at collection no longer matches the payload the platform indexed. This guide is the step-by-step procedure for wiring registry rules into a specific platform under the ESI Format Mapping Standards, diagnosing the resource-constrained failures that break mapping, and restoring defensibility so every native format resolves to a canonical review representation without evidentiary loss.

Diagnostic Log Signatures

The failure is rarely a clean exception. Under memory pressure the mapping worker degrades non-deterministically — it emits an HTTP 503 to the platform API, silently drops a hash comparison, or aborts a container mid-unpack. The ingestion log makes the compounding failure vectors legible:

text

[INGEST-PROC] WARN  FormatMapper: Native handler timeout for DOCX/ZIP hybrid (PID: 4491)
[EXTRACT-ENG] ERROR Memory allocation exceeded threshold (2.1GB > 1.5GB cap) during OLE2 stream parsing
[HASH-VERIF] MISMATCH: SHA256(expected: a3f8...) != SHA256(actual: 7b2c...) for native payload
[PRIV-LOG]   WARN  Orphaned child document detected; parent container extraction aborted
[MAP-EXIT]   ERROR Worker exited non-zero (code 137) after OOM-kill of PID 4491

Exit code 137 (128 + SIGKILL) is the tell: the OS killed the worker before it finalized the mapping record, so the platform received a partial commit. Use this symptom checklist to confirm you are looking at a mapping-stage failure rather than a downstream extraction bug:

Review panel renders no text, but the native file opens correctly outside the platform.
The manifest shows a recomputed hash that differs from the collection-time digest.
Child items (attachments, embedded objects) appear without a parent, or the parent lists a child count higher than the ingested count.
Worker logs show a code 137 / code 139 exit adjacent to a 503 returned to the platform API.

Root-Cause Breakdown

Three mechanics, usually acting together, produce the signatures above:

Hybrid container overhead. Modern Office files (DOCX, XLSX, PPTX) are ZIP archives that frequently embed OLE2 streams. Decompressing the ZIP and parsing the embedded compound file simultaneously pushes thread-local allocation past the worker’s memory ceiling, triggering a stop-the-world garbage-collection pause and, eventually, the OOM-kill.
Hash-state corruption. When a rolling hash buffer is interrupted mid-read — because the process was paused or killed — the digest is finalized over an incomplete byte stream. The result is a mismatch against the collection-time value even though the file on disk is intact, which is why disciplined cryptographic hash generation must stream over bounded reads and finalize only after the whole payload is consumed.
Fallback that skips validation. With no explicit fallback rule, an unrecognized signature routes to a legacy parser that bypasses magic-byte reconciliation entirely — the same class of misrouting analyzed in MIME type detection with libmagic. A spoofed .docx that is really a ZIP bomb then reaches an unguarded extractor, and the memory ceiling is breached before any integrity check runs.

Remediation Architecture

The fix is to make mapping deterministic and memory-bounded before it ever calls the platform API. Normalization must follow a fixed sequence so that no parser runs until the payload has been identified, isolated, and hashed:

Signature verification — validate magic bytes against a curated registry to defeat extension spoofing.
Container boundary enforcement — isolate embedded streams (OLE2, ZIP entries, PDF attachments) before invoking any downstream parser.
Hash chain initialization — compute SHA-256 and MD5 over the raw native payload prior to any transformation.
Metadata propagation — attach privilege tags, custodian identifiers, and family pointers to the normalized record.

The diagram below shows this deterministic normalization sequence end to end.

The following implementation makes each stage explicit, with a memory guardrail that rejects an archive whose uncompressed payload would breach the ceiling — closing the zip-bomb path — and streamed hashing that keeps peak memory independent of file size. It aligns with Python’s zipfile documentation and the FIPS 180-4 hashing standard.

python

import hashlib
import os
import zipfile
from pathlib import Path
from typing import Dict

# Curated magic bytes for common ESI formats.
MAGIC_BYTES: Dict[bytes, str] = {
    b"PK\x03\x04": "application/zip",
    b"\xd0\xcf\x11\xe0": "application/x-ole2",
    b"%PDF": "application/pdf",
}


def compute_hashes(file_path: Path, chunk_size: int = 8192) -> Dict[str, str]:
    """Stream SHA-256 and MD5 over bounded reads so peak memory stays flat."""
    sha256 = hashlib.sha256()
    md5 = hashlib.md5()
    try:
        with open(file_path, "rb") as f:
            while chunk := f.read(chunk_size):
                sha256.update(chunk)
                md5.update(chunk)
        # Finalize only after the whole payload is consumed — never mid-stream.
        return {"sha256": sha256.hexdigest(), "md5": md5.hexdigest()}
    except OSError as e:
        raise RuntimeError(f"Hash computation failed for {file_path}: {e}") from e


def map_native_format(file_path: Path, max_memory_mb: int = 1500) -> Dict:
    """Deterministic format mapping with memory guardrails and an audit trail."""
    if not file_path.exists():
        raise FileNotFoundError(f"Native ESI not found: {file_path}")

    # Stage 3, computed first so the custody anchor exists before any transform.
    pre_hashes = compute_hashes(file_path)

    # Stage 1: magic-byte identification overrides the file extension.
    try:
        with open(file_path, "rb") as f:
            header = f.read(4)
    except OSError as e:
        raise RuntimeError(f"Header read failed for {file_path}: {e}") from e

    mime_type = MAGIC_BYTES.get(header, "application/octet-stream")

    # Stage 2: container boundary enforcement before any downstream parser.
    if mime_type == "application/zip":
        try:
            with zipfile.ZipFile(file_path) as zf:
                # Reject archives whose uncompressed payload would breach the
                # ceiling, preventing the OOM-kill seen at code 137.
                total_uncompressed = sum(info.file_size for info in zf.infolist())
                if total_uncompressed > max_memory_mb * 1024 * 1024:
                    raise ValueError(
                        f"Uncompressed size {total_uncompressed} bytes exceeds "
                        f"{max_memory_mb}MB memory cap"
                    )
                # Validate archive integrity before mapping any entry.
                first_bad = zf.testzip()
                if first_bad is not None:
                    raise ValueError(f"Corrupt entry detected in archive: {first_bad}")
                # Memory-aware extraction would route to the platform API here.
        except zipfile.BadZipFile as e:
            raise RuntimeError(f"ZIP mapping aborted: {e}") from e

    # Stage 4: return a validated payload for platform ingestion.
    return {
        "file_path": str(file_path),
        "detected_mime": mime_type,
        "pre_ingest_hashes": pre_hashes,
        "mapping_status": "VALIDATED",
        "audit_trail_id": f"MAP-{os.urandom(8).hex()}",
    }

Because the guardrail rejects oversized archives before extraction and the hash is finalized only over a complete read, the two dominant failure vectors — the OOM-kill and the mid-stream digest corruption — can no longer reach the platform commit.

Incident Recovery & Reconciliation

For batches that already failed, immediate isolation and cryptographic reconciliation restore defensibility. The recovery flow below moves a failing batch from isolation back to a clean re-extraction.

Quarantine failing batches. Route documents that triggered an HTTP 503 or a hash mismatch to a secure staging directory. Do not retry against the primary ingestion queue without first lowering the memory cap.
Recompute and reconcile hashes. Run compute_hashes against the quarantined files and compare the output to the original processing manifest. A divergence isolates payload corruption in transit from a digest that was simply never finalized — the reconciliation discipline detailed in generating SHA-256 hashes for chain of custody.
Rebuild family trees. Correlate quarantined items by their audit_trail_id to locate the affected mapping events, then re-extract the container hierarchies with a single-threaded parser to eliminate the race that produced the orphaned children.
Preserve audit trails. Log every mapping decision, memory-threshold adjustment, and hash verification. Immutable audit logs are mandatory for privilege-schema validation and downstream compliance review.

Verification Checklist

Confirm the fix before releasing the batch back to the primary queue:

Recomputed SHA-256 matches the collection-time digest for every item in the batch.
No worker exited with code 137 or code 139 during the reprocessing run.
Every child document resolves to a parent, and parent child-counts equal the ingested count.
The uncompressed-size guardrail rejected or passed each archive explicitly — no silent extraction.
The audit log records a VALIDATED status and an audit_trail_id for each mapped item.
The platform review panel renders extracted text for a sampled item from the batch.

Conclusion

By hashing before transformation, enforcing container boundaries with an uncompressed-size guardrail, and reconciling quarantined batches against the original manifest, native-to-platform mapping stops failing non-deterministically. The OOM-kill, the corrupted digest, and the orphaned child become predictable, recoverable events with a documented audit trail — which is exactly what a defensible pipeline must demonstrate under cross-examination.

Frequently Asked Questions

Why does the recomputed hash differ when the file on disk is intact?

Because the original digest was finalized over an incomplete read. When the worker is OOM-killed mid-stream (code 137), the rolling buffer is finalized before the whole payload is consumed, producing a value that will never match a clean recomputation. Streaming the hash over bounded reads and finalizing only after the full file is consumed — as compute_hashes does — makes the mismatch disappear on the reprocessing run, which is how you distinguish a killed worker from genuine payload corruption.

How do I set the memory cap for hybrid DOCX/OLE2 files?

Size the cap to the worker’s thread-local heap minus headroom for the platform client, not to total host RAM, because a DOCX decompresses its ZIP and parses embedded OLE2 streams at the same time. Start with the sum of the largest expected uncompressed archive plus the OLE2 working set, then confirm the guardrail rejects anything above it before extraction. A cap set from host RAM will still OOM once several large hybrids land on one worker concurrently.

Can I re-run a quarantined batch through the primary queue?

Not without lowering the memory cap first. The same batch that OOM-killed a worker will do it again under identical limits, and a blind retry against evidentiary material risks a second partial commit. Reprocess quarantined items on a constrained, single-threaded path, verify the hashes reconcile against the manifest, and only then release them.

Generating SHA-256 Hashes for Chain of Custody — the streaming-digest procedure that anchors reconciliation.
Attachment Parent-Child Mapping — preserving family lineage so containers never orphan their children.
Production Compliance Frameworks — the load-file, Bates, and audit checks mapping outputs must satisfy before delivery.
Privilege Schema Design — the privilege tags and redaction boundaries propagated during Stage 4.

Up next: return to ESI Format Mapping Standards to see how this procedure fits the broader mapping and validation pipeline.