Async Batch Processing Design for eDiscovery ESI Workflows

Deterministic throughput and unbroken chain of custody are the two non-negotiable properties of any production ingestion tier, and both live or die on how batches are scheduled. This subsystem owns the concurrency layer of the ESI Ingestion & Processing Workflows pipeline: it sits between raw custodial media landing on disk and the review-ready records that downstream indexing consumes, and it is where multi-terabyte collections either flow at a predictable rate or collapse into out-of-memory kills, half-written state, and unaccountable gaps in the audit trail. Async batch processing decouples I/O-bound extraction, cryptographic hash generation, and metadata normalization from synchronous request cycles so that a single oversized PST can never stall the thousands of small documents queued behind it. This guide details a memory-aware, backpressure-driven batch processor that enforces compliance boundaries, routes failures deterministically to a dead-letter queue, and scales horizontally without ever letting a file advance on an unverified digest.

Architecture Overview

The processor is a bounded producer/consumer graph. A single producer walks the ingest tree and enqueues records; a fixed pool of consumers drains the queue under a concurrency semaphore. Every item crosses the same ordered boundaries — discovery, hash-first validation, async extraction, schema normalization, and audit — and any item that exhausts its retry budget branches to dead-letter routing before the audit manifest is written. Because the queue is fixed-capacity, a fast producer cannot outrun slow consumers: the put() call blocks, propagating backpressure all the way back to directory traversal.

The diagram below traces a single item through the six pipeline stages, including the dead-letter branch taken on repeated failure.

Hashing precedes extraction so the digest anchors identity before any transformation; a repeatedly failing item branches to the dead-letter queue but is still reconciled into the audit manifest.

The ordering is not cosmetic. Hashing precedes extraction because the digest is the item’s immutable identity for the remainder of its lifecycle; a hash computed after any transformation proves only that the transformed copy is self-consistent, not that it faithfully represents what the custodian produced. Normalization precedes the audit write because the manifest must record the final, schema-conformant metadata, not an intermediate view.

Memory-Aware Batch Orchestration

ESI volumes routinely exceed terabyte-scale thresholds, which makes naive glob() or os.walk()-into-a-list approaches untenable in production. Materializing an entire directory tree of a large collection into a Python list can consume gigabytes of resident memory before a single file is processed, and it defeats streaming entirely. A compliant pipeline must bound memory consumption through backpressure-aware generators and fixed-capacity queues, yielding file paths or byte streams in controlled increments rather than all at once.

Two ceilings govern the design. The first is the queue depth: a bounded asyncio.Queue caps how many pending records can exist in flight, so a producer that races ahead simply blocks on put() until consumers catch up. The second is the concurrency limit: an asyncio.Semaphore caps how many records are being actively worked at once, which in turn bounds the number of open file descriptors and in-flight extraction buffers. Together they convert an unbounded firehose into a steady-state rhythm whose peak memory is a function of two tunable constants rather than of collection size.

Batch sizing should be dynamic, calculated per worker from available heap, file-type distribution, and downstream service latency. Large container formats — PST, OST, ZIP, EML bundles — require stream-based parsing rather than full in-memory extraction, and they should be sized onto their own track so the occasional multi-gigabyte outlier does not cause head-of-line blocking for the small documents behind it. The same fixed-size streaming discipline that keeps native file ingestion memory-flat applies here: read in 4–8 MiB chunks, never load a whole file to hash it, and release each buffer before requesting the next.

Async Execution & Concurrency Model

Once records are enqueued, task dispatch must remain non-blocking, and the concurrency model has to respect the fundamental split between the two kinds of work the pipeline performs. Cryptographic hashing is CPU- and disk-bound; text extraction is I/O- and latency-bound. Mixing them naively stalls the event loop, because a synchronous hashlib read of a multi-gigabyte file blocks every other coroutine for the duration. Production deployments therefore offload the streaming digest to a thread executor — asyncio.to_thread or a dedicated ThreadPoolExecutor — while awaiting extraction as a true coroutine. This hash-first model guarantees that integrity verification, the foundational requirement for FRCP Rule 34 chain of custody, completes before any transformation alters the original binary.

Semaphore sizing is the single most consequential tuning decision. Size it too high and workers thrash the disk and exhaust file descriptors; too low and expensive extraction latency dominates while cores sit idle. A workable starting point for a mixed I/O and CPU workload is to bound in-flight concurrency at roughly the core count scaled by the ratio of wait time to service time — for a stage that spends most of its time awaiting an extraction service, that pushes concurrency well above the core count, whereas a purely CPU-bound hashing stage should stay near it. Measure event-loop latency in production: if asyncio.sleep(0) round-trips or I/O waits begin exceeding tens of milliseconds, the pool is oversubscribed and concurrency should drop or the work should shift horizontally.

Text extraction introduces significant latency variance, so integrating PDF & text extraction engines requires connection pooling, per-file timeout enforcement via asyncio.wait_for, and graceful degradation when encountering corrupted or password-protected documents. Async patterns let the pipeline yield control during a blocking extraction call so unrelated files continue processing, all while maintaining strict per-batch isolation so one poisoned document cannot corrupt a neighbor’s state.

Resilience, Failure Routing & Compliance Boundaries

Async systems inevitably encounter transient failures, corrupted archives, and downstream service degradation. Production-grade eDiscovery pipelines must distinguish between failure classes and route each deterministically, because “retry everything three times” both wastes budget on unrecoverable errors and masks systemic degradation behind noisy backoff loops.

Failure class	Representative signatures	Routing policy	Compliance implication
Transient	Network timeout, temporary file lock, queue backpressure	Retry with exponential backoff and jitter	None if it eventually succeeds; logged for observability
Structural	Corrupted header, unreadable sector, password-protected archive	Route to DLQ after retry budget; preserve original unaltered	Item must appear in the exceptions report, never silently dropped
Compliance	Schema mismatch, missing custodian field, digest divergence	Quarantine and halt advancement; flag for forensic review	Chain of custody is broken until reconciled by a human

Retries must be driven by an iterative loop rather than recursion, and the backoff sleep must happen outside the concurrency semaphore. Recursing while holding a semaphore slot, or sleeping through the backoff while holding it, starves the worker pool: a handful of failing items in exponential backoff would pin every concurrency slot and deadlock throughput even though the CPU is idle. Exponential backoff with jitter spreads retry storms so that a downstream service recovering from an outage is not immediately re-saturated by every worker retrying in lockstep.

When a specific endpoint or extraction backend fails repeatedly, a circuit breaker should trip to halt dispatch to that dependency while allowing unaffected batches to proceed. This prevents cascading resource exhaustion — the classic failure mode where every worker piles up waiting on a dead service until the whole pool is blocked. Every state transition, hash computation, and routing decision is logged to an append-only audit trail, and the compliance boundary is absolute: no file is marked complete until its cryptographic signature matches the original ingest manifest and its extracted metadata conforms to the target review platform schema. Dead-letter records are persisted as self-describing manifests so the exceptions population is auditable and reconcilable rather than a black hole.

Production Implementation

The following module is a self-contained, runnable async batch processor. It enforces hash-first validation, applies backpressure via a bounded queue, releases the semaphore between retry attempts, and routes exhausted items to a dead-letter queue with a preserved failure manifest. All operations emit structured logs for auditability and align with EDRM Processing-stage expectations. When the in-process model reaches its ceiling, the same record contract carries over to a distributed broker — see implementing Celery for async eDiscovery batching for the task-idempotency and result-backend variant of this design.

python

import asyncio
import hashlib
import json
import logging
import os
from dataclasses import dataclass, field
from enum import Enum
from pathlib import Path
from typing import AsyncGenerator, Optional, Dict, Any

# Configure structured audit logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s"
)
logger = logging.getLogger("ediscovery.async_batch")

class ProcessingStatus(str, Enum):
    PENDING = "PENDING"
    HASHED = "HASHED"
    EXTRACTED = "EXTRACTED"
    FAILED = "FAILED"
    DLQ = "DLQ"

@dataclass
class ESIRecord:
    file_path: Path
    sha256: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)
    status: ProcessingStatus = ProcessingStatus.PENDING
    error: Optional[str] = None

class AsyncBatchProcessor:
    def __init__(
        self,
        max_concurrency: int = 8,
        queue_capacity: int = 128,
        max_retries: int = 3,
        dlq_path: Path = Path("./dlq")
    ):
        self.max_concurrency = max_concurrency
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=queue_capacity)
        self.max_retries = max_retries
        self.dlq_path = dlq_path
        self.dlq_path.mkdir(parents=True, exist_ok=True)
        self.retry_counts: Dict[str, int] = {}

    async def file_generator(self, root_dir: Path) -> AsyncGenerator[Path, None]:
        """Yields file paths with backpressure-aware traversal."""
        for dirpath, _, filenames in os.walk(root_dir):
            for fname in filenames:
                yield Path(dirpath) / fname
                # Yield control to event loop to prevent queue starvation
                await asyncio.sleep(0)

    def compute_sha256(self, file_path: Path) -> str:
        """Synchronous, streaming SHA-256 computation for chain-of-custody."""
        sha256 = hashlib.sha256()
        with open(file_path, "rb") as f:
            while chunk := f.read(8192):
                sha256.update(chunk)
        return sha256.hexdigest()

    async def extract_text_async(self, record: ESIRecord) -> str:
        """Simulates async I/O-bound extraction. Wrap the real engine call in
        asyncio.wait_for() to enforce a per-file timeout in production."""
        await asyncio.sleep(0.1)  # Placeholder for actual engine call
        # In production: integrate with an async HTTP client or subprocess pool
        return f"extracted_text_{record.file_path.stem}"

    async def route_to_dlq(self, record: ESIRecord) -> None:
        """Deterministic failure routing with audit trail preservation."""
        record.status = ProcessingStatus.DLQ
        dlq_file = self.dlq_path / f"{record.file_path.name}.json"
        logger.warning(f"Routing {record.file_path} to DLQ: {record.error}")
        # Persist failure manifest for legal hold/compliance review. Use json.dumps
        # so error strings containing quotes or newlines never corrupt the manifest.
        with open(dlq_file, "w") as f:
            json.dump(
                {"path": str(record.file_path), "error": record.error, "status": "DLQ"},
                f,
            )

    async def process_record(self, record: ESIRecord) -> ESIRecord:
        """Core async execution graph with hash-first enforcement.

        Retries are driven by an iterative loop rather than recursion so the
        concurrency semaphore is released between attempts; recursing while
        holding the semaphore would deadlock the worker pool under contention.
        """
        retry_key = str(record.file_path)
        for attempt in range(1, self.max_retries + 1):
            async with self.semaphore:
                try:
                    # 1. Cryptographic hash (synchronous streaming read).
                    record.sha256 = self.compute_sha256(record.file_path)
                    record.status = ProcessingStatus.HASHED
                    logger.info(f"Hashed: {record.file_path.name} | SHA256: {record.sha256[:12]}...")

                    # 2. Async text/metadata extraction.
                    record.metadata["extracted_text"] = await self.extract_text_async(record)
                    record.status = ProcessingStatus.EXTRACTED
                    logger.info(f"Extracted: {record.file_path.name}")
                    return record
                except Exception as e:
                    record.error = str(e)
                    record.status = ProcessingStatus.FAILED
                    self.retry_counts[retry_key] = attempt
                    if attempt < self.max_retries:
                        logger.warning(
                            f"Retry {attempt}/{self.max_retries} for {record.file_path.name}: {e}"
                        )

            # Backoff happens outside the semaphore so a sleeping retry does not
            # hold a concurrency slot hostage.
            if attempt < self.max_retries:
                await asyncio.sleep(0.5 * attempt)

        await self.route_to_dlq(record)
        return record

    async def run(self, ingest_root: Path) -> None:
        """Pipeline orchestrator: materializes batches, dispatches workers, drains queue."""
        producer_task = asyncio.create_task(self._producer(ingest_root))
        consumer_tasks = [
            asyncio.create_task(self._consumer()) for _ in range(self.max_concurrency)
        ]

        await producer_task
        await self.queue.join()

        for task in consumer_tasks:
            task.cancel()
        await asyncio.gather(*consumer_tasks, return_exceptions=True)
        logger.info("Batch processing complete. All workers drained.")

    async def _producer(self, root: Path) -> None:
        async for path in self.file_generator(root):
            record = ESIRecord(file_path=path)
            await self.queue.put(record)

    async def _consumer(self) -> None:
        while True:
            try:
                record = await self.queue.get()
            except asyncio.CancelledError:
                break
            try:
                await self.process_record(record)
            finally:
                # Always acknowledge the item so queue.join() can complete even
                # if processing raises an unexpected error.
                self.queue.task_done()

The producer blocks on a full queue so backpressure reaches directory traversal; a Semaphore(N) caps in-flight work, each worker hashes before it extracts, and the retry backoff sleeps outside the slot so failing items never pin the pool.

Horizontal Scaling & Distributed Execution

The in-process async model above is ideal for a single-node ingestion server, where all concurrency lives in one event loop and one memory space. For distributed litigation-support environments spanning many machines, workload partitioning shifts to a message-broker architecture, but the record contract — hash-first, immutable digest, deterministic DLQ routing — remains identical. Implementing Celery for async eDiscovery batching covers how to decouple the producer/consumer pattern across Kubernetes pods or VM fleets while preserving task idempotency and result-backend consistency, so a redelivered task never double-writes an audit record or re-hashes an already-verified file.

The scaling boundary is worth naming explicitly. In-process async is bounded by a single host’s cores, file descriptors, and memory; a broker-backed model trades that ceiling for network hops, broker durability guarantees, and the operational cost of a result backend. The right cutover point is usually the moment a single node can no longer meet the throughput SLA for a matter’s deadline, or when fault isolation across physical hosts becomes a defensibility requirement in its own right.

Observability & Compliance Metrics

Instrumentation is not optional in a defensible pipeline; the metrics are part of the audit story. Three core KPIs localize any regression to a specific boundary and give litigation support an early warning before a deadline is at risk:

Throughput (files/sec and GB/hr): validates SLA adherence during tight discovery windows and reveals when a stage has become the bottleneck.
Integrity verification rate: the proportion of items whose recomputed digest matches the ingest manifest; any sustained non-100% reading signals filesystem corruption, partial reads, or a broken chain of custody that must halt the run.
DLQ accumulation velocity: the rate at which items enter the dead-letter queue — the earliest indicator of systemic extraction failure or malformed custodial media, long before a human notices the exceptions report is growing.

Export these via OpenTelemetry or Prometheus and forward structured logs to append-only storage so the audit trail survives platform migration. A minimal, runnable instrumentation layer:

python

from prometheus_client import Counter, Gauge, Histogram

# Labelled by pipeline stage so a regression localizes to intake, hashing,
# extraction, or normalization from a single dashboard.
BATCH_ITEMS = Counter(
    "esi_batch_items_total", "Records completed", ["status"]
)
INTEGRITY_CHECKS = Counter(
    "esi_batch_integrity_total", "Digest verifications", ["result"]
)
DLQ_DEPTH = Gauge(
    "esi_batch_dlq_depth", "Current dead-letter queue depth"
)
STAGE_LATENCY = Histogram(
    "esi_batch_stage_seconds", "Per-stage processing latency", ["stage"]
)


def record_stage(stage: str, elapsed: float, status: str) -> None:
    """Emit throughput and latency for one completed stage transition."""
    BATCH_ITEMS.labels(status=status).inc()
    STAGE_LATENCY.labels(stage=stage).observe(elapsed)


def record_integrity(matched: bool) -> None:
    """Record whether a recomputed digest matched the ingest manifest."""
    INTEGRITY_CHECKS.labels(result="match" if matched else "mismatch").inc()

Alerting thresholds should trip before memory exhaustion or queue saturation, so throughput and DLQ velocity alarms buy time to scale horizontally rather than firing after a worker has already been OOM-killed mid-batch. Integrity mismatches deserve a page, not a dashboard tile — a single mismatch is a potential chain-of-custody break that must conform to the same evidentiary standards enforced by the site’s production compliance frameworks.

Conclusion

Async batch processing for eDiscovery ESI workflows eliminates synchronous bottlenecks while enforcing strict chain-of-custody boundaries. By combining memory-aware generators, a bounded queue and semaphore that convert an unbounded firehose into a steady-state rhythm, hash-first execution that anchors integrity before any transformation, and deterministic failure routing that never drops an item silently, legal engineering teams scale ingestion throughput without sacrificing auditability. The compliance guarantee this subsystem provides is precise: every item that reaches the audit manifest has a verified, immutable digest, and every item that could not is accounted for in a reconcilable dead-letter population. Its scaling limit is equally precise — the in-process model is bounded by a single host, and crossing that boundary means adopting a broker-backed distributed design that preserves the same record contract.

Frequently Asked Questions

How do I size the concurrency semaphore for a mixed hashing and extraction workload?

Start from the split between CPU-bound and I/O-bound work. A hashing stage that reads from disk and burns CPU should keep in-flight concurrency near the core count, because oversubscription just thrashes the disk and exhausts file descriptors. An extraction stage that spends most of its time awaiting a downstream service can safely run concurrency well above the core count, scaled by the ratio of wait time to service time. Measure event-loop latency in production: once asyncio.sleep(0) round-trips or I/O waits exceed tens of milliseconds, the pool is oversubscribed and you should lower concurrency or move work to another node.

Why must the retry backoff happen outside the semaphore?

Because a slot held during a sleep is a slot no other item can use. If retries recursed while holding the semaphore, or slept through exponential backoff inside the async with self.semaphore block, a handful of failing items in backoff would pin every concurrency slot and deadlock the whole pool while the CPU sits idle. Releasing the semaphore before the backoff sleep — as the iterative loop in the implementation does — keeps healthy items flowing while failing items wait their turn to retry.

What belongs in a dead-letter record, and why persist it as a manifest?

A DLQ record must be self-describing enough to reconcile without re-running the pipeline: the original file path, the final error, the status, and ideally the last-known digest and attempt count. Persisting it as a JSON manifest (written with json.dump, so quotes and newlines in error strings never corrupt the file) turns the exceptions population into an auditable, reconcilable set rather than a black hole. That manifest is what lets a legal team demonstrate that no item was silently lost — every file either reached the audit manifest with a verified digest or appears in the dead-letter set with a documented reason.

When should I move from in-process asyncio to a distributed broker?

Cut over when a single node can no longer meet the throughput SLA for a matter’s deadline, or when fault isolation across physical hosts becomes a defensibility requirement. The in-process model is bounded by one host’s cores, file descriptors, and memory; a broker-backed model trades that ceiling for network hops and the operational cost of a durable result backend. Keep the record contract identical across the boundary so idempotent redelivery never double-writes an audit record.

Native File Ingestion Pipelines — content-signature MIME detection and format-family routing that feeds this processor.
Cryptographic Hash Generation — the streaming SHA-256/MD5 digest logic this design invokes hash-first.
PDF & Text Extraction Engines — the latency-variable extraction stage awaited inside each worker.
Implementing Celery for Async eDiscovery Batching — the distributed, broker-backed variant of this record contract.
Production Compliance Frameworks — the evidentiary standards the integrity and DLQ guarantees must satisfy.

Up one level: ESI Ingestion & Processing Workflows — the full intake-to-production pipeline this concurrency layer sits inside.