Implementing Celery for Async eDiscovery Batching: Resolving OOM & Hash Verification Failures

A Celery worker pool running --concurrency=8 --pool=prefork over a 500-file native batch starts logging MemoryError, then signal 9 (SIGKILL), and within seconds the audit trail fills with Hash verification failed mismatches. This failure lands in the Processing stage of the EDRM pipeline — the concurrency tier owned by Async Batch Processing Design — and it breaks the one property that stage exists to guarantee: that every file advances only on a finalized, reproducible digest. When the Linux OOM killer terminates a worker mid-task, cryptographic hash generation never finalizes, the result backend records a partial state, and byte-identical files that should collapse in deduplication instead register as unverified. This page isolates the root cause (unbounded in-memory reads under a fixed cgroup limit, not a broken algorithm) and gives a minimal, deployable fix that restores bounded memory and defensible chain of custody.

Diagnostic Log Signatures

The failure is deterministic, not stochastic: it reproduces whenever a prefork pool with eight children runs against a batch exceeding roughly 500 native files that contains multi-gigabyte containers, deeply nested email archives, or forensic image formats. Worker logs carry a recognizable signature:

text

[2024-03-15 14:22:01,112: ERROR/MainProcess] Task ediscovery.tasks.process_native_file[uuid-1] raised unexpected: MemoryError
[2024-03-15 14:22:01,115: WARNING/MainProcess] Process 'Worker-1' pid:10482 exited with 'signal 9 (SIGKILL)'
[2024-03-15 14:22:01,120: INFO/Worker-2] Hash verification failed for /mnt/esi/case_004/batch_12/IMG_8842.HEIC: expected SHA-256 mismatch (computed vs. manifest)

Exit code 137 (128 + signal 9) confirms the kernel OOM killer, not the application, terminated the child. Symptom checklist:

Workers exit with code 137 / SIGKILL immediately after a MemoryError, correlated with the largest files in the batch.
The Celery result backend shows tasks stuck in STARTED or flipped to FAILURE with no finalized digest recorded.
Hash verification failed mismatches appear only for files that were in flight when a child died — the digest was never finalized.
Total resident memory tracks the sum of concurrently loaded file sizes, not a bounded per-worker constant.
Restarting with reduced concurrency makes the batch pass, proving the fault is memory pressure, not corruption.

Root-Cause Breakdown

The mismatch is a memory-management problem that surfaces as a cryptographic one. Three contributing factors compound:

Unbounded memory allocation. Loading an entire file with pathlib.Path.read_bytes() (or any read-the-whole-thing pattern) before hashing places the full file in resident memory. Eight children each holding a multi-gigabyte container simultaneously exceed the per-worker cgroup ceiling — commonly 4 GB — and the kernel reaps the process.
Stream interruption and descriptor leaks. Extraction libraries such as textract, pdfplumber, and libmagic retain open file descriptors and intermediate buffers. When the OS kills the process mid-task, those buffers are never released and the in-progress hashlib context is discarded, so no digest is finalized. The same fixed-size streaming discipline that keeps native file ingestion memory-flat is what this stage is missing.
Partial-state writes. With default acknowledgement semantics, a task acked before completion is lost when its worker dies, so the backend commits a half-finished record. Downstream deduplication then treats an unverified file as authoritative, corrupting the family and violating the chain-of-custody promise that the same input always yields the same digest.

Remediation Architecture

The digest of a stream is independent of how it is chunked only if every byte reaches hashlib.update() exactly once, in source order, and the process survives to finalize it. The fix combines three controls: bounded streaming reads so no single file can spike resident memory, worker recycling that restarts children before they approach the cgroup ceiling, and late acknowledgement so an OOM kill returns the task to the queue instead of committing a partial state.

1. Configure the worker for bounded memory and fault tolerance

Set worker_max_memory_per_child below the cgroup ceiling so Celery gracefully recycles a child before the kernel kills it, cap worker_max_tasks_per_child to release accumulated fragmentation and leaked descriptors, and enable task_acks_late with task_reject_on_worker_lost so an unacknowledged task returns to the broker on worker loss rather than vanishing.

2. Hash by streaming fixed-size chunks

Open each file in binary mode and feed fixed 1 MB blocks to hashlib.sha256() until EOF. Peak memory becomes a function of the chunk size, not the file size, so a 40 GB PST hashes in the same footprint as a 40 KB email. The routine is thread-safe for dispatch across the async batch processing worker pool.

python

import hashlib
import logging
from pathlib import Path
from typing import Optional

from celery import Celery

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger("ediscovery.batch")

# Fault-tolerant routing plus memory safeguards. The soft memory limit sits
# below the 4 GB cgroup ceiling so Celery recycles a child *before* the kernel
# OOM killer can reap it mid-task and orphan an unfinalized digest.
app = Celery("ediscovery_worker")
app.conf.update(
    task_acks_late=True,                 # ack only after the task completes
    task_reject_on_worker_lost=True,     # requeue if a child dies mid-task
    worker_max_tasks_per_child=50,       # recycle to release leaked FDs/buffers
    worker_max_memory_per_child=3_500_000,  # ~3.5 GB soft limit, in KB
)

CHUNK_SIZE = 1024 * 1024  # 1 MB streaming buffer: peak memory is O(chunk), not O(file)


def compute_sha256_streamed(file_path: Path) -> Optional[str]:
    """Compute SHA-256 in bounded memory, streaming fixed-size blocks."""
    sha256 = hashlib.sha256()
    try:
        with open(file_path, "rb") as fh:
            while True:
                chunk = fh.read(CHUNK_SIZE)
                if not chunk:
                    break
                sha256.update(chunk)
        return sha256.hexdigest()
    except (IOError, OSError) as exc:
        logger.error("stream read failed for %s: %s", file_path, exc)
        return None

3. Verify against the manifest and reject partial state

The task validates existence and non-zero size, streams the digest, and compares it to the manifest hash. A transient I/O failure retries with backoff; a genuine mismatch is flagged for review rather than silently committed, so no unverified file enters the production set. This is the ingestion-time counterpart to resolving MD5 and SHA-256 hash divergence across nodes.

python

@app.task(bind=True, max_retries=3, default_retry_delay=60)
def process_native_file(
    self, file_path: str, expected_hash: Optional[str] = None
) -> dict:
    path = Path(file_path)

    # Validation: existence, type, and non-zero size before any hashing.
    if not path.exists() or not path.is_file():
        raise FileNotFoundError(f"native file not found: {file_path}")
    if path.stat().st_size == 0:
        raise ValueError(f"zero-byte file rejected: {file_path}")

    computed_hash = compute_sha256_streamed(path)
    if computed_hash is None:
        # Transient I/O (e.g. a flaky network share): retry with backoff
        # before surfacing a hard failure to the result backend.
        raise self.retry(exc=RuntimeError("hash computation aborted on I/O failure"))

    # Validate against the ingestion manifest when a digest was supplied.
    if expected_hash and computed_hash.lower() != expected_hash.lower():
        logger.warning(
            "hash mismatch for %s: expected=%s computed=%s",
            file_path, expected_hash, computed_hash,
        )
        return {
            "status": "hash_mismatch",
            "file": file_path,
            "computed_hash": computed_hash,
            "expected_hash": expected_hash,
            "requires_review": True,
        }

    return {
        "status": "success",
        "file": file_path,
        "computed_hash": computed_hash,
        "size_bytes": path.stat().st_size,
    }

4. Recover a batch that already OOM-ed

When workers have already died mid-batch, recovery must preserve evidentiary integrity without reprocessing verified assets. The flow below traces triage from an OOM or SIGKILL termination through to a reconciled audit trail.

Isolate partial states. Query the result backend for FAILURE and REVOKED tasks and cross-reference the ingestion manifest to find files that carry no finalized hash.
Re-queue with bounded concurrency. Restart the pool with --pool=solo or --concurrency=2 to isolate memory-heavy files; task_acks_late=True guarantees unacknowledged tasks return to the broker on worker loss.
Reconcile the audit trail. Log every hash, retry, and mismatch to an append-only or WORM store, keyed by task_id, file_path, computed_hash, and processing_timestamp, so the corrected run is reproducible from the record.

Configure the recycle thresholds against the official Celery worker memory-management guidance, and validate the streaming digest against Python’s hashlib documentation for FIPS 180-4 conformance.

Verification Checklist

compute_sha256_streamed reads fixed CHUNK_SIZE blocks; no read_bytes() or whole-file load remains in any task.
worker_max_memory_per_child sits below the cgroup ceiling, and workers recycle gracefully instead of exiting with code 137.
task_acks_late=True and task_reject_on_worker_lost=True are set; a killed child requeues its task instead of committing partial state.
Resident memory per worker stays flat across a full batch, independent of the largest container’s size.
Every file’s computed digest matches its manifest hash, or is routed to review with requires_review=True — never silently committed.
No MemoryError or signal 9 (SIGKILL) lines appear in worker logs across a full re-run of the failing batch.
Each digest event is written to the immutable audit store with task_id, file_path, computed_hash, and timestamp.

Conclusion

The OOM-plus-mismatch failure is almost never a broken hash — it is a whole-file read the kernel was allowed to punish under a fixed memory ceiling. Streaming fixed-size chunks makes peak memory independent of file size, recycling children below the cgroup limit lets Celery restart gracefully instead of being reaped mid-digest, and late acknowledgement guarantees an interrupted task returns to the queue rather than committing an unverified record. With those three controls in place the batch runs at steady-state memory, every digest is finalized before the file advances, and each verification decision is reproducible from a signed audit trail — the defensibility guarantee the batching stage exists to provide.

Frequently Asked Questions

Why does streaming fix the hash mismatch when the algorithm never changed?

Because the mismatch was never a hashing defect — it was an interrupted computation. The whole-file read pushed a worker past its memory ceiling, the kernel killed it mid-task, and the hashlib context was discarded before hexdigest() ran, so the backend recorded a partial or absent digest. Streaming fixed 1 MB chunks holds peak memory at roughly the chunk size regardless of file size, so the worker survives to finalize the digest. The bytes fed to hashlib.update() are identical either way; the difference is that the process now lives long enough to finish.

Should I switch from the prefork pool to gevent or eventlet to avoid OOM kills?

Not for this failure. A greenlet pool changes concurrency semantics but not the per-file memory footprint — a whole-file read still spikes resident memory whichever pool schedules it. Fix the memory pattern first with streaming reads and worker_max_memory_per_child. Prefork remains the right choice for the CPU- and disk-bound hashing this stage performs; gevent helps only when the bottleneck is high-latency I/O waits, such as calling out to a remote extraction service.

Does lowering CHUNK_SIZE change the final digest or hurt throughput?

No to the digest, marginally to throughput. The digest of a stream is mathematically independent of how it is split, provided every byte reaches hashlib.update() exactly once and in order — 1 MB and 8 KB blocks produce identical output. Very small chunks add syscall overhead, and very large ones raise peak memory back toward the ceiling. 1 MB is a practical middle ground; keep it a fixed constant across every worker image so audit logs and reproducibility stay clean.

Async Batch Processing Design — the concurrency tier, backpressure model, and dead-letter routing this worker fits into.
Cryptographic Hash Generation — the chain-of-custody digest requirement an interrupted worker fails to finalize.
Synchronizing MD5 and SHA-256 Across Nodes — the distributed variant of the same streaming-integrity defect.
Native File Ingestion Pipelines — the fixed-size streaming discipline that keeps ingestion memory-flat upstream of this stage.
Debugging pdfplumber Truncation and OOM at Scale — the extraction-stage OOM this batching layer must not propagate.

Up one level: Async Batch Processing Design — memory-aware batch orchestration and the compliance boundaries this Celery worker enforces.