Hash-Based Deduplication Strategies: Defensible Exact-Match Filtering at ESI Scale

Hash-based deduplication is the deterministic gatekeeper of the EDRM Processing stage: it collapses byte-identical file instances into a single canonical record before threading, family grouping, or privilege review ever runs, and it does so with a cryptographic proof that the collapse was correct. Inside the broader Deduplication & Family Grouping pipeline, exact-match hashing is the cheapest and most defensible cull available — it removes redundancy without ever inspecting content semantics — but it is also the stage where naive implementations quietly fail at scale. An unbounded in-memory hash set exhausts worker RAM halfway through a terabyte collection; a single-algorithm key invites a collision argument from opposing counsel; a swallowed I/O error silently drops a responsive document from the production set. This guide builds a subsystem that survives all three: async memory-aware batching, a dual-algorithm composite key, deterministic fallback routing to near-duplicate detection, and an immutable audit trail engineered to withstand a Daubert challenge to the process.

Deduplication Subsystem Flow

Exact-match hashing is not a single function call; it is a staged pipeline where each stage owns a distinct failure mode and a distinct compliance obligation. Path enumeration streams work in without materializing the directory tree, chunked digest computation bounds memory, the registry lookup makes the suppress-or-register decision, and a dead-letter branch rescues files that cannot be hashed before they can silently vanish from the manifest.

Exact-Match Hashing: Algorithm Selection and the Dedup Key

Deduplication resolves identity by content, not by filename, path, or database row id. The identity function is a cryptographic digest, and the choice of algorithm is not cosmetic — it governs both interoperability with downstream review platforms and defensibility under adversarial scrutiny. Production systems run two algorithms in parallel because they answer two different questions.

MD5 remains the de facto deduplication key across Relativity, Nuix, and Concordance, so a pipeline that intends to hand off to those platforms must emit an MD5 that matches theirs bit-for-bit or the receiving system will re-duplicate the collection. SHA-256 provides the collision-resistant integrity anchor that chain-of-custody regimes require, and it is the digest computed once at ingestion through cryptographic hash generation and then propagated, never recomputed, through every downstream transformation. Storing both and keying the registry on the composite md5:sha256 tuple means an MD5 collision alone can never suppress a genuinely distinct document — the SHA-256 half of the key would differ — which is precisely the argument that neutralizes a collision challenge before it starts.

Algorithm	Digest size	Collision resistance	Throughput	eDiscovery role
MD5	128 bit	Broken (chosen-prefix)	Fastest	Interop dedup key for review platforms
SHA-1	160 bit	Broken (SHAttered)	Fast	Legacy only; avoid for new matters
SHA-256	256 bit	Strong	Moderate	Chain-of-custody integrity anchor
BLAKE2b	256–512 bit	Strong	Faster than SHA-256	Optional high-throughput integrity digest

The residual collision risk of the composite key is quantifiable, and quantifying it is part of being defensible. For a hash space of $2^b$ values and a collection of $n$ items, the probability of at least one accidental collision follows the birthday bound:

p(n) \approx 1 - e^{-\,n^2 / 2^{\,b+1}}

For SHA-256 ( $b = 256$ ) across a billion-document matter ( $n = 10^9$ ), $p$ sits below $10^{-58}$ — comfortably beyond any threshold a court would treat as material. This is why the SHA-256 half of the key carries the integrity guarantee while the MD5 half carries interoperability.

One niche subtlety dominates real datasets: loose files and email are not hashed the same way. A loose document (DOCX, XLSX, PDF) is hashed over its raw bytes as they arrive from native file ingestion. An email, however, must be hashed over a normalized projection of its fields — typically From, To, CC, BCC, Subject, SentOn, and the body — because two exports of the same message from different mail stores carry different envelope bytes but represent the identical communication. Hashing the raw MSG bytes would treat them as distinct and defeat the entire purpose of the cull. The routing contract therefore selects the digest input by ESI class before a single byte is fed to hashlib.

Memory and Resource Constraints at ESI Scale

The naive implementation — walk the tree, read each file fully into memory, add its hash to a Python set — works flawlessly on a 10,000-file sample and then dies on the first real custodian collection. Two independent resources blow at once: reading a 4 GB PST fully into RAM before hashing spikes per-worker memory into OOM territory, and an in-memory set holding hundreds of millions of 96-byte composite keys becomes a multi-gigabyte structure that cannot be shared across worker processes and vanishes entirely on restart, forcing a full reprocess.

Three constraints drive the design, and each one is a hard requirement rather than an optimization:

Fixed-window digest computation. Files are read in fixed-size blocks (the reference engine uses 8 MB) and fed incrementally to hashlib.update(). Peak memory per in-flight file is the block size, not the file size, so a 40 GB mail store hashes with the same footprint as a 40 KB memo.
Externalized registry. The seen-hash set lives in a durable, shared store — Redis, RocksDB, or a transactional SQLite/Postgres table — not in interpreter memory. This makes the registry survive worker restarts, lets parallel workers share one deduplication view, and turns the suppress decision into an atomic insert rather than a read-modify-write race. A local Bloom filter can front the registry to absorb the overwhelmingly common unique-file case without a network round trip, with the authoritative store consulted only on a Bloom hit.
Backpressure at the event loop. Ingestion yields control after each batch so that I/O-bound disk reads overlap with CPU-bound digest work instead of one starving the other. This is the same semaphore-bounded discipline used across async batch processing elsewhere in the pipeline, and it is what keeps memory flat across the entire collection.

The governing rule is that memory footprint must be predictable and flat regardless of custodian volume or file-size distribution. A pipeline whose RAM usage scales with the collection is not production-ready, because the one collection that matters is always the largest one you have not seen yet.

Async Execution and Concurrency Model

Hashing is dominated by disk I/O, not CPU, so the concurrency model is tuned for overlap rather than parallel computation. The engine wraps each file’s read-and-hash in a coroutine, gates the number of concurrent coroutines with an asyncio.Semaphore, and drains completed tasks in batches so that results stream out as fast as digests finish rather than blocking on the slowest file in a batch.

Semaphore sizing is the one parameter operators must get right. Because the work is I/O-bound, the useful concurrency ceiling is set by the storage backend’s queue depth, not by CPU core count — an NVMe array tolerates far more in-flight reads than a networked share. A practical starting point is 8–16 concurrent reads per worker, tuned upward until throughput plateaus and downward if the storage layer begins returning timeouts. Oversizing the semaphore does not speed anything up; it only inflates memory (more 8 MB windows live at once) and can tip a shared filesystem into thrashing. Draining each batch with asyncio.as_completed and then yielding control with await asyncio.sleep(0) gives the event loop a scheduling point between batches, which is what actually enforces backpressure against an unbounded producer.

Resilience and Failure Routing

Real collections are hostile. Files are zero-byte, truncated mid-transfer, permission-locked, or corrupt inside a nested container. A defensible engine treats every one of these as a routed event, never as a silent skip — a file that fails to hash and is dropped without a record is a responsive document that vanished from the production, and that is the exact failure a chain-of-custody regime exists to prevent.

When an exact match fails not because the file is broken but because it differs only in non-substantive bytes — a re-saved PDF, a container repackaged with a new timestamp, an email with a rewritten gateway header — the file is not a duplicate and must not be suppressed, but it is also a strong near-duplicate candidate. The engine routes these to similarity threshold configuration for fuzzy-hash near-duplicate scoring before they reach a reviewer, preserving processing velocity while guaranteeing coverage. Files that cannot be hashed at all — I/O errors, decryption failures, corruption — are written to a dead-letter manifest with their path, error class, and timestamp, so a human can adjudicate them rather than a try/except swallowing them.

A circuit breaker guards the whole run. Error rates above a rolling threshold (commonly 2%) signal a systemic fault — a mounted share going stale, a decryption key expiring, a corrupt source image — rather than isolated bad files, and continuing past that point only manufactures an undefensible result faster. Tripping the breaker halts ingestion and flags the run for operator review.

The diagram below traces the per-file decision path from digest computation through suppression, registration, and fallback routing.

Reference Implementation: Async Deduplication Engine

The following module implements an async deduplication engine with structured JSON logging, memory-constrained batching, composite-key registry lookup, explicit dead-letter routing for I/O failures, and a circuit breaker on the rolling error rate. It adheres to the Python hashlib documentation for secure digest generation and uses aiofiles for non-blocking disk I/O. For the deterministic-chunking failure mode that surfaces when this engine runs across a distributed worker pool, see Synchronizing MD5 and SHA-256 hashes across processing nodes.

python

import asyncio
import hashlib
import json
import logging
from dataclasses import dataclass, field
from pathlib import Path
from typing import AsyncIterator, List, Optional, Set, Tuple

import aiofiles


# Structured JSON logging so every suppression, registration, and dead-letter
# event lands in one immutable audit stream a reviewer can replay.
class JSONFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        entry = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "module": record.module,
            "message": record.getMessage(),
        }
        if record.exc_info:
            entry["exception"] = self.formatException(record.exc_info)
        return json.dumps(entry)


logger = logging.getLogger("edisc.hash_dedup")
logger.setLevel(logging.INFO)
_handler = logging.StreamHandler()
_handler.setFormatter(JSONFormatter())
logger.addHandler(_handler)


@dataclass
class FileDigest:
    path: str
    md5: str
    sha256: str
    custodian_id: str
    status: str  # "unique" | "duplicate" | "error"
    family_id: Optional[str] = None
    error_msg: Optional[str] = None


@dataclass
class DedupStats:
    processed: int = 0
    duplicates: int = 0
    errors: int = 0
    dead_letter: List[str] = field(default_factory=list)

    @property
    def error_rate(self) -> float:
        seen = self.processed + self.duplicates + self.errors
        return self.errors / seen if seen else 0.0


class CircuitBreakerTripped(RuntimeError):
    """Raised when the rolling error rate exceeds the defensible threshold."""


class DeduplicationEngine:
    CHUNK_SIZE = 8 * 1024 * 1024  # 8 MB read window bounds per-file memory
    BATCH_SIZE = 250

    def __init__(
        self,
        max_concurrency: int = 12,
        registry: Optional[Set[str]] = None,
        error_threshold: float = 0.02,
        min_sample: int = 100,
    ) -> None:
        # In production, back `seen_hashes` with Redis/RocksDB so the view is
        # shared across workers and survives restarts; a set keeps the demo runnable.
        self.seen_hashes: Set[str] = registry or set()
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.error_threshold = error_threshold
        self.min_sample = min_sample
        self.stats = DedupStats()

    async def _compute_digests(self, path: Path) -> Optional[Tuple[str, str]]:
        """Stream MD5 and SHA-256 in fixed windows to bound memory usage."""
        md5, sha256 = hashlib.md5(), hashlib.sha256()
        try:
            async with aiofiles.open(path, "rb") as fh:
                while chunk := await fh.read(self.CHUNK_SIZE):
                    md5.update(chunk)
                    sha256.update(chunk)
            return md5.hexdigest(), sha256.hexdigest()
        except OSError as exc:
            logger.error(f"hash_io_error path={path} err={exc}")
            return None

    def _check_breaker(self) -> None:
        seen = self.stats.processed + self.stats.duplicates + self.stats.errors
        if seen >= self.min_sample and self.stats.error_rate > self.error_threshold:
            raise CircuitBreakerTripped(
                f"error_rate={self.stats.error_rate:.3f} exceeds "
                f"threshold={self.error_threshold:.3f}"
            )

    async def _process_single(self, path: Path, custodian_id: str) -> FileDigest:
        async with self.semaphore:
            digests = await self._compute_digests(path)
            if digests is None:
                self.stats.errors += 1
                self.stats.dead_letter.append(str(path))
                self._check_breaker()
                return FileDigest(str(path), "", "", custodian_id, "error",
                                  error_msg="hash computation failed")

            md5, sha256 = digests
            key = f"{md5}:{sha256}"  # composite key: MD5 collision alone cannot suppress

            if key in self.seen_hashes:
                self.stats.duplicates += 1
                logger.info(f"suppress_duplicate path={path} key={key}")
                return FileDigest(str(path), md5, sha256, custodian_id,
                                  "duplicate", family_id=key)

            self.seen_hashes.add(key)
            self.stats.processed += 1
            logger.info(f"register_master path={path} key={key}")
            return FileDigest(str(path), md5, sha256, custodian_id,
                              "unique", family_id=key)

    async def run_pipeline(
        self, files: AsyncIterator[Path], custodian_id: str
    ) -> AsyncIterator[FileDigest]:
        """Stream files through bounded batches, yielding each result as it finishes."""
        batch: List[asyncio.Task] = []
        async for path in files:
            batch.append(asyncio.create_task(self._process_single(path, custodian_id)))
            if len(batch) >= self.BATCH_SIZE:
                for done in asyncio.as_completed(batch):
                    yield await done
                batch.clear()
                await asyncio.sleep(0)  # scheduling point: enforce backpressure
        for done in asyncio.as_completed(batch):
            yield await done

The engine never deletes. A suppressed duplicate is recorded, not discarded, and it inherits the family_id of the canonical master so that downstream family grouping can still find every instance if a privilege or production question later requires reconstructing the full set.

Distributed Synchronization and Cross-Matter Scaling

When the engine runs across a worker pool, the in-memory set above must become a centralized, append-only registry backed by Redis or a distributed key-value store. Each node publishes its composite digest alongside metadata — custodian id, source path, ingestion timestamp — and the suppress decision becomes an atomic SETNX (or HSETNX) so that a race between two workers hashing identical files resolves to exactly one master and one suppressed duplicate rather than two masters. Without atomic insert semantics, network partitions and clock skew let mismatched or duplicate digests commit silently, which is the divergence failure mode covered in depth on the synchronization page linked above.

Cross-matter deduplication across multi-case litigation extends the same architecture by partitioning registries by matter id while maintaining a global enterprise index. This dual-index design lets each matter deduplicate within its own scope — the legally correct default, since a document’s responsiveness is matter-specific — while still eliminating redundant storage across overlapping custodian populations, which is where the storage savings on large corporate collections actually come from.

Observability and Compliance Metrics

A deduplication run that cannot be measured cannot be defended. Three KPIs give operations and counsel the signal they need, and each maps to a specific defensibility question:

Hashing throughput (GB/hr and files/sec) — the scaling signal; a sustained drop flags a slow storage tier or an oversized semaphore thrashing a shared share.
Integrity rate (successfully hashed ÷ total enumerated) — the coverage signal; anything below 100% means files were dead-lettered, and every one of them is a document a human must adjudicate before production.
Dead-letter velocity (unhashable files/min) — the health signal; a rising rate is the early warning the circuit breaker will act on, and it usually means a systemic fault rather than isolated bad files.

Instrumenting them is a small wrapper over the same DedupStats the engine already maintains, emitted on the same structured logger so the metrics share the audit stream with every suppression event:

python

import time
from dataclasses import dataclass, field


@dataclass
class DedupMetrics:
    started: float = field(default_factory=time.monotonic)
    bytes_hashed: int = 0

    def snapshot(self, stats: "DedupStats") -> dict:
        elapsed = max(time.monotonic() - self.started, 1e-9)
        enumerated = stats.processed + stats.duplicates + stats.errors
        integrity = (enumerated - stats.errors) / enumerated if enumerated else 1.0
        return {
            "throughput_gb_hr": round(self.bytes_hashed / 1e9 / (elapsed / 3600), 2),
            "integrity_rate": round(integrity, 4),
            "dead_letter_velocity_min": round(stats.errors / elapsed * 60, 2),
        }


metrics = DedupMetrics()
# ... accumulate metrics.bytes_hashed per file, then once per batch: ...
logger.info(json.dumps(metrics.snapshot(engine.stats)))

These local controls are the concrete expression of the matter-wide rules defined in the project’s Production Compliance Frameworks; deduplication inherits its retention, logging, and reproducibility obligations from that layer rather than inventing its own. Because the registry insert is content-addressed and atomic, replaying the same custodial set against the same registry yields byte-identical suppression decisions — the reproducibility property opposing counsel will probe first.

Downstream Integration

Once deduplication completes, the pipeline emits a deterministic manifest of unique paths, composite digests, and family ids. That manifest feeds directly into Email Threading Algorithms so that only the canonical instance of each message participates in conversation reconstruction — the reason canonical-instance selection must happen here, before threading, and not after. It simultaneously informs Attachment & Parent-Child Mapping, which uses the inherited family_id to preserve relational context so that suppressing a duplicate binary never orphans a privileged child document from the parent it belongs to.

Conclusion

By pairing an interoperable MD5 with a collision-resistant SHA-256 in a single composite key, bounding memory with fixed-window streaming, externalizing the registry for atomic cross-worker suppression, and routing every unhashable or near-duplicate file to an explicit review path rather than a silent skip, legal automation engineers get an exact-match filter that is both fast and defensible. The compliance guarantee it provides is narrow but load-bearing: every suppression is content-addressed, individually logged, and reproducible from the audit trail, and no responsive document ever leaves the collection without a record of why. Its scaling limit is set not by CPU but by the throughput and reliability of the slowest storage tier in the matter — which is why integrity rate and dead-letter velocity, not raw speed, are the metrics that decide whether a deduplication result is ready to hand downstream.

Frequently Asked Questions

Why hash with both MD5 and SHA-256 instead of just SHA-256?

Because they answer different questions. MD5 is the deduplication key that Relativity, Nuix, and Concordance already use, so emitting a matching MD5 keeps the receiving platform from re-duplicating your collection. SHA-256 supplies the collision-resistant integrity anchor for chain of custody. Keying the registry on the composite md5:sha256 tuple means a broken MD5 collision alone can never suppress a genuinely distinct document, which is exactly the argument that defuses a collision challenge.

How do I keep memory flat when hashing multi-gigabyte PST files?

Never read a file fully into RAM. Stream it in fixed-size windows (8 MB works well) and feed each block to hashlib.update(), so peak per-file memory equals the window size regardless of file size. Pair that with an externalized registry (Redis/RocksDB) instead of an in-memory set, so the seen-hash structure does not itself grow into a multi-gigabyte object that dies on restart.

What should happen to a file that fails to hash?

It must be routed to a dead-letter manifest with its path, error class, and timestamp — never silently skipped. A dropped file is a responsive document that vanished from the production, which is the precise failure chain-of-custody exists to catch. Track dead-letter velocity as a KPI and trip a circuit breaker when the rolling error rate exceeds roughly 2%, because that signals a systemic fault rather than isolated bad files.

How do I deduplicate email so the same message from two mailboxes collapses?

Do not hash the raw MSG/EML bytes — different exports of the same message carry different envelope bytes. Hash a normalized projection of the message fields (From, To, CC, BCC, Subject, SentOn, and the body) so that two copies of the identical communication produce the identical key. Loose files, by contrast, are hashed over their raw bytes; the routing contract selects the digest input by ESI class before hashing.

Similarity Threshold Configuration — fuzzy-hash near-duplicate scoring for the files exact-match filtering routes to fallback.
Attachment & Parent-Child Mapping — using the inherited family id so suppression never orphans a child document.
Email Threading Algorithms — the downstream stage that consumes only canonical instances of each message.
Synchronizing MD5 and SHA-256 hashes across processing nodes — resolving digest divergence across a distributed worker pool.
Cryptographic Hash Generation — where the ingestion-time SHA-256 anchor this stage propagates is first computed.

Up one level: Deduplication & Family Grouping — the processing stage that anchors, deduplicates, threads, and groups every record into court-ready families.