Resolving MD5 and SHA-256 Hash Divergence Across Distributed eDiscovery Processing Nodes

Two ingestion workers hash the same PST-extracted .msg, and the central manifest logs sync_commit_failed: md5_divergence — the same file yields a1b2c3… on Node A and d4e5f6… on Node B. This failure lands squarely in the Processing stage of the EDRM pipeline, at the exact-match filter built in Hash-Based Deduplication Strategies, and it breaks the one property that stage exists to guarantee: that identity-by-content is reproducible. When digests diverge by node, the composite md5:sha256 dedup key becomes non-deterministic, byte-identical files register as distinct masters, and the chain-of-custody promise that “the same input always produces the same digest” collapses — precisely the reproducibility opposing counsel probes first. This page isolates the root cause (non-deterministic I/O buffering under memory pressure, not a cryptographic flaw) and gives an auditable, deployable fix.

Diagnostic Log Signatures

The divergence is deterministic, not stochastic: it reproduces whenever two nodes run identical Python 3.11+ runtimes under different cgroup memory limits (for example 2 GB vs. 4 GB) against the same 5 GB+ container of nested MSG/OLE objects. Worker logs carry a recognizable signature:

text

WARN  hash_worker: memory_pressure_detected, switching_to_disk_spool
DEBUG chunk_offset_mismatch: node_A=1048576 node_B=1048572
ERROR sync_commit_failed: md5_divergence (expected a1b2c3…, got d4e5f6…)
INFO  sha256_verification: partial_update_rejected

The chunk_offset_mismatch line — a four-byte gap between nodes — confirms non-deterministic buffering rather than corruption. The partial_update_rejected line means the fallback chain caught the divergence, but without an atomic rollback the pipeline still commits mismatched state. Symptom checklist:

Same file, same algorithm, different digest keyed to the worker node, reproducible on re-run.
Divergence correlates with memory_pressure_detected warnings, never with clean-memory runs.
Only a small fraction of files (those large enough to trigger disk-spool fallback) are affected.
Downstream family grouping shows duplicate “masters” for a message that should have collapsed to one.

Root-Cause Breakdown

Digest divergence across nodes is an I/O determinism problem, not an algorithm problem. Four contributing factors compound:

Dynamic buffer resizing under memory pressure. Python’s io.BufferedReader resizes its internal buffer when RAM is constrained. Because the fallback chunker keys its read window off that buffer, two nodes with different cgroup limits feed hashlib.update() different byte windows — and if a chunk boundary lands mid-object, the final fragment can inherit trailing padding or a UTF-8 BOM artifact. MD5’s sensitivity to trailing bytes and SHA-256’s avalanche effect turn a single-byte offset into a fully divergent digest.
Disk-spool fallback without fixed alignment. When workers spill to disk under pressure, spooled chunk boundaries are not aligned to a fixed block size, so the byte ranges handed to the hasher differ from the in-memory path on the other node.
Non-atomic cross-node state commits. Concurrent writes to a shared registry (Redis, DynamoDB) under network partition or clock skew let a stale read commit a mismatched digest silently, so a divergence that started as a buffering artifact becomes durable corrupted state.
Text-mode or read-ahead contamination. Any node that opens a file with OS read-ahead or newline translation active hashes a different byte stream than one reading strictly unbuffered binary — the same defect that surfaces in cryptographic hash generation when files are opened in text mode.

The digest never changes because of an algorithm — it changes because memory pressure let the OS move the chunk boundary. A fixed, unbuffered read window removes that freedom.

Remediation Architecture

The digest of a stream is independent of how it is chunked only if every byte is fed to hashlib.update() in source order from a boundary the OS cannot shift. The fix enforces that with three controls: deterministic fixed-size unbuffered reads, a byte-count assertion against filesystem size, and an atomic two-phase commit so a divergence can never silently overwrite good state.

1. Deterministic dual-hash computation

Open the file unbuffered (buffering=0) so no OS read-ahead or BufferedReader resizing can move a chunk boundary, read fixed 8,192-byte blocks until EOF, and validate the processed byte count against st_size to catch truncation from an OOM kill. This routine is thread-safe for dispatch across an async batch processing worker pool.

python

import hashlib
import logging
from pathlib import Path
from typing import NamedTuple

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger("hash_sync")

CHUNK_SIZE = 8192  # Fixed block size: identical on every node, regardless of RAM.


class DualDigest(NamedTuple):
    md5: str
    sha256: str
    bytes_processed: int


def compute_dual_hash(file_path: Path, node_id: str) -> DualDigest:
    """Compute MD5 and SHA-256 with node-independent, byte-level determinism.

    Reads unbuffered in fixed windows so no OS read-ahead or dynamic buffer
    resize can shift a chunk boundary between nodes, then asserts the processed
    byte count equals the on-disk size to reject any truncated/partial read.
    """
    path = Path(file_path)
    if not path.is_file():
        raise FileNotFoundError(f"Artifact not found: {path}")

    md5 = hashlib.md5()
    sha256 = hashlib.sha256()
    bytes_processed = 0

    # buffering=0 disables read-ahead; "rb" disables newline/encoding translation.
    with open(path, "rb", buffering=0) as f:
        while chunk := f.read(CHUNK_SIZE):
            md5.update(chunk)
            sha256.update(chunk)
            bytes_processed += len(chunk)

    actual_size = path.stat().st_size
    if bytes_processed != actual_size:
        raise ValueError(
            f"Byte mismatch on {node_id}: read {bytes_processed} vs st_size "
            f"{actual_size} for {path.name}. Possible OOM truncation or "
            "mid-ingestion modification; chain of custody cannot be certified."
        )

    md5_hex, sha256_hex = md5.hexdigest(), sha256.hexdigest()
    if len(md5_hex) != 32 or len(sha256_hex) != 64:
        raise ValueError(f"Digest length validation failed for {path.name}")

    logger.info(
        "hash_ok node=%s file=%s md5=%s sha256=%s bytes=%d",
        node_id, path.name, md5_hex, sha256_hex, bytes_processed,
    )
    return DualDigest(md5_hex, sha256_hex, bytes_processed)

2. Atomic cross-node commit

Stage the digest locally, then publish to the shared registry with an idempotent, compare-first upsert. If another node already registered a digest for the same file, the two must match before either commits; a mismatch is dead-lettered for a deterministic re-hash rather than silently overwriting.

python

from typing import Optional


def commit_digest(registry, path_key: str, digest: DualDigest, node_id: str) -> str:
    """Two-phase commit: register if absent, else require cross-node consensus.

    `registry.set_if_absent` must be an atomic operation (e.g. Redis SETNX) so a
    race between two workers hashing the same file resolves to one authoritative
    record rather than two divergent masters.
    """
    composite = f"{digest.md5}:{digest.sha256}"
    won = registry.set_if_absent(path_key, composite)
    if won:
        return "registered"

    existing: Optional[str] = registry.get(path_key)
    if existing == composite:
        return "consensus"  # Both nodes agree — deterministic fix confirmed.

    registry.dead_letter(path_key, node_id, computed=composite, existing=existing)
    logger.error("md5_divergence path=%s node=%s existing=%s got=%s",
                 path_key, node_id, existing, composite)
    return "divergence"

The sequence below shows both nodes reporting dual digests to the central manifest and the deterministic re-hash triggered by a chunk-offset mismatch.

The manifest never overwrites silently: it reconciles both reports, pinpoints the diverging offset, and drives a deterministic re-hash before any digest is committed.

3. Operational recovery when divergence is already in production

Isolate. Quarantine the affected node pool and halt downstream indexing so corrupted state does not cascade into family grouping.
Recompute. Re-run compute_dual_hash against the original source media — never against cached or spooled intermediates.
Reconcile. Cross-reference recomputed digests against the staging registry and apply atomic upserts only for verified matches.
Certify. Emit a signed manifest of the corrected digests and archive it alongside the production as the record of defensible recovery.

Verification Checklist

Every node opens files with buffering=0 and mode rb — no read-ahead, no newline translation.
CHUNK_SIZE is a fixed constant, identical across all worker images and cgroup limits.
bytes_processed equals st_size for every file; any mismatch is raised, not logged-and-skipped.
The same file hashed on a 2 GB node and a 4 GB node produces byte-identical MD5 and SHA-256.
Registry writes use an atomic set-if-absent; divergent digests route to the dead-letter manifest.
chunk_offset_mismatch and md5_divergence no longer appear in worker logs across a full re-run.
Every digest event is logged with node_id, offsets, byte count, and timestamp to the immutable audit ledger.

Conclusion

Hash divergence across nodes is almost never a broken algorithm — it is a chunk boundary the operating system was allowed to move under memory pressure. Reading unbuffered in a fixed window makes the digest independent of every node’s RAM, the byte-count assertion rejects any partial read before it can commit, and the atomic compare-first upsert guarantees that two workers hashing the same file converge on one authoritative record. With those three controls in place, the composite dedup key is deterministic again, byte-identical files collapse as they should, and every suppression decision is reproducible from a signed audit trail — the defensibility guarantee the deduplication stage is built to provide.

Frequently Asked Questions

Why do the digests differ across nodes when the algorithm is identical?

Because the input windows differ, not the algorithm. Under memory pressure Python’s BufferedReader resizes and disk-spool fallback shifts chunk boundaries, so two nodes feed hashlib.update() different byte ranges — and if a boundary lands mid-object, trailing padding or a BOM byte gets included on one node and not the other. Reading unbuffered in a fixed 8,192-byte window removes the OS’s freedom to move the boundary, so both nodes hash identical bytes in identical order and converge on one digest.

Does chunk size affect the final digest value?

No — the digest of a stream is mathematically independent of how it is split, provided every byte reaches hashlib.update() exactly once and in source order. The bug is never the chunk size itself; it is a boundary shifting so a byte is duplicated, skipped, or padded. Pin CHUNK_SIZE to one constant across every worker image purely for reproducibility and clean audit logs, not because the value changes the result.

A file passes on both nodes but still fails the byte-count assertion — what does that mean?

That the read was truncated, usually by an OOM kill mid-stream, so bytes_processed is less than st_size. The digest computed over a partial read is well-formed but wrong. Treat the assertion failure as authoritative: quarantine the file, re-run against the original source media on a node with adequate memory, and never commit a digest whose byte count does not match the filesystem size.

Hash-Based Deduplication Strategies — the composite md5:sha256 key and registry this divergence corrupts.
Debugging SHA-256 Hash Generation Failures — the ingestion-time variant of the same binary-streaming defect.
Similarity Threshold Configuration — where files that cannot be exact-matched are routed for near-duplicate scoring.
Production Compliance Frameworks — the matter-wide reproducibility and audit obligations this fix satisfies.
Deduplication & Family Grouping — the processing stage whose family integrity depends on deterministic digests.

Up one level: Hash-Based Deduplication Strategies — defensible exact-match filtering and the dedup key this page keeps deterministic.