Should transient and structural errors share a retry path?

No. Retries belong only to transient faults such as share timeouts, temporary locks, or brief backpressure, using exponential backoff with full jitter. Structural faults like a corrupt header fail deterministically, so they go straight to the dead-letter manifest with a forensic snapshot, and a circuit breaker halts a source whose structural-failure rate spikes.

Native File Ingestion Pipelines: Implementation Architecture & Production Patterns

Native file ingestion represents the foundational stage of any defensible ESI Ingestion & Processing Workflows pipeline. Unlike converted or normalized formats, native files preserve original filesystem metadata, embedded objects, and application-specific structures that must be captured without modification. A production-grade ingestion pipeline must enforce strict chain-of-custody boundaries, implement memory-aware asynchronous processing, and route files deterministically based on type, size, and schema compliance. This resource details the implementation architecture for native file ingestion within the EDRM Processing stage, emphasizing production-ready Python patterns, structured observability, and deterministic fallback routing so that every artifact remains traceable from raw intake through the handoff to extraction.

Native intake as one deterministic spine: an item is hashed before any transformation, classified on content, then processed and handed to extraction — while ambiguous containers re-gate, disallowed types quarantine, and every failure lands in an append-only dead-letter manifest.

Memory-Aware Async Architecture & Backpressure Control

Native file processing demands strict control over memory footprint and I/O concurrency. Litigation datasets routinely contain multi-terabyte volumes with highly variable file sizes, from kilobyte text files to multi-gigabyte CAD or PST containers. Blocking I/O and unbounded buffering will trigger out-of-memory (OOM) conditions or stall worker threads. The pipeline must leverage asyncio for non-blocking disk operations, enforce explicit memory ceilings per batch, and apply backpressure through bounded queues.

Naive approaches fail predictably at ESI scale for three reasons. First, open(path).read() on a 40 GB PST resolves to a single allocation that no container memory limit can absorb. Second, unbounded task fan-out lets thousands of coroutines each hold an open file descriptor, exhausting the process FD table long before RAM is the binding constraint. Third, synchronous hashing on the event loop blocks every other coroutine while a single large file is digested. The design below addresses all three: files stream in fixed-size chunks (typically 4–8 MB) rather than loading into RAM, a shared byte counter enforces a hard memory ceiling, and an explicit asyncio.Lock serializes the accounting so concurrent workers cannot race past the limit.

Each chunk is processed sequentially for integrity verification, MIME classification, and metadata extraction before being flushed to downstream stages. Batching is governed by a sliding window that respects both memory limits and worker concurrency caps. When memory pressure exceeds thresholds, the pipeline pauses intake, drains pending tasks, and resumes only when resources normalize.

python

import asyncio
import aiofiles
import hashlib
import logging
from pathlib import Path
from typing import AsyncGenerator, Dict, Any
from dataclasses import dataclass

logger = logging.getLogger("native_ingestion")

@dataclass(frozen=True)
class PipelineConfig:
    chunk_size: int = 4 * 1024 * 1024  # 4 MB
    max_queue_depth: int = 50
    memory_ceiling_mb: int = 2048
    worker_concurrency: int = 8

class AsyncIngestionWorker:
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.queue: asyncio.Queue[Path] = asyncio.Queue(maxsize=config.max_queue_depth)
        self.active_bytes = 0
        self._lock = asyncio.Lock()

    async def _check_memory_pressure(self, file_size: int) -> bool:
        async with self._lock:
            if self.active_bytes + file_size > self.config.memory_ceiling_mb * 1024 * 1024:
                return True
            self.active_bytes += file_size
            return False

    async def _release_memory(self, file_size: int):
        async with self._lock:
            self.active_bytes -= file_size

    async def stream_file_chunks(self, file_path: Path) -> AsyncGenerator[bytes, None]:
        async with aiofiles.open(file_path, mode="rb") as f:
            while chunk := await f.read(self.config.chunk_size):
                yield chunk

    async def process_file(self, file_path: Path) -> Dict[str, Any]:
        file_size = file_path.stat().st_size

        # Apply backpressure with an iterative wait loop rather than recursion,
        # which would otherwise risk stack exhaustion under sustained pressure.
        while await self._check_memory_pressure(file_size):
            logger.warning("Memory ceiling reached. Backpressure applied. Waiting for drain.")
            await asyncio.sleep(0.5)

        hasher = hashlib.sha256()
        try:
            async for chunk in self.stream_file_chunks(file_path):
                hasher.update(chunk)
        finally:
            await self._release_memory(file_size)

        return {
            "path": str(file_path),
            "size_bytes": file_size,
            "sha256": hasher.hexdigest(),
            "status": "ingested"
        }

Concurrency Model, Semaphore Sizing & Retry Policy

Memory accounting caps how many bytes are in flight; a semaphore caps how many files are open concurrently. The two limits are independent and both are required. A single 8-worker semaphore paired with a 2 GB ceiling means the pipeline will admit up to eight small files at once but will admit only one multi-gigabyte container until it drains. This is the Async Batch Processing Design primitive applied to native intake: bound the fan-out, let the shared counter absorb size variance, and gather results deterministically.

Size the semaphore to the I/O profile, not the CPU count. Ingestion is I/O-bound — the worker spends its wall-clock time waiting on disk or network shares — so a concurrency of 2–4× the physical core count keeps the storage layer saturated without thrashing the event loop. Watch event-loop latency: if scheduling delay exceeds roughly 50 ms, the loop is oversubscribed and concurrency should drop or the work should shard across processes.

python

import asyncio
import random
from pathlib import Path
from typing import AsyncIterator, Awaitable, Callable, Dict, Any, List, TypeVar

T = TypeVar("T")

async def with_retry(
    op: Callable[[], Awaitable[T]],
    *,
    attempts: int = 4,
    base_delay: float = 0.25,
) -> T:
    """Retry a transient async operation with exponential backoff and full jitter."""
    last_exc: Exception | None = None
    for attempt in range(attempts):
        try:
            return await op()
        except (TimeoutError, BlockingIOError, ConnectionError) as exc:
            last_exc = exc
            delay = base_delay * (2 ** attempt)
            await asyncio.sleep(random.uniform(0, delay))  # full jitter
    raise RuntimeError("transient retries exhausted") from last_exc

async def bounded_ingest(
    paths: AsyncIterator[Path],
    worker: AsyncIngestionWorker,
    concurrency: int,
) -> List[Dict[str, Any]]:
    """Ingest a stream of paths under a fixed concurrency bound."""
    semaphore = asyncio.Semaphore(concurrency)
    results: List[Dict[str, Any]] = []

    async def _guarded(path: Path) -> None:
        async with semaphore:
            record = await with_retry(lambda: worker.process_file(path))
            results.append(record)

    tasks = [asyncio.create_task(_guarded(p)) async for p in paths]
    await asyncio.gather(*tasks)
    return results

Only genuinely transient errors — network timeouts on a mounted share, temporary file locks, brief queue backpressure — belong on the retry path. Structural errors such as a corrupt header must never be retried; a retry loop around a deterministically failing read wastes throughput and inflates latency percentiles without ever succeeding. The routing section below draws that line explicitly.

MIME Detection & Deterministic Routing

Accurate file classification precedes all downstream processing. Extension-based detection is legally insufficient; binary signature analysis is required to satisfy evidentiary standards. Implementing accurate MIME type detection with libmagic ensures that files are routed to appropriate handlers based on actual content rather than superficial naming conventions. A file renamed from payload.exe to invoice.pdf must be caught by its MZ header, not trusted by its extension. Misclassified files are quarantined immediately to prevent pipeline corruption or downstream extraction failures.

Routing decisions are encoded as a deterministic state machine. Supported MIME types proceed to hashing and metadata extraction. Unsupported or encrypted formats trigger a fallback route to a secure quarantine directory with structured audit logging. Ambiguous files — chiefly containers, which may legitimately hold discoverable material or may be a disguised binary — undergo secondary signature scanning before final disposition. Determinism here is a defensibility property: given the same bytes, the pipeline must always reach the same disposition, because a routing decision that varies run-to-run cannot survive scrutiny of the process.

The routing table below fixes the disposition and ordering constraint for each format family the intake stage recognizes.

Format family	Representative MIME types	Ordering constraint	Disposition
Documents	`application/pdf`, `application/vnd.openxmlformats-officedocument.wordprocessingml.document`	hash, then extract text	PROCESS
Email stores	`application/vnd.ms-outlook`, `message/rfc822`	hash, then expand into family	PROCESS (expand)
Archives / containers	`application/zip`, `application/x-7z-compressed`	signature recheck, then recurse	SECONDARY_SCAN
Executables	`application/x-dosexec`, `application/x-executable`	none	QUARANTINE
Encrypted / unknown	`application/octet-stream`	none	QUARANTINE

The flowchart below shows how a classified MIME type is routed to one of the three terminal dispositions.

Routing is a deterministic cascade: allowlisted types are processed, ambiguous containers are re-scanned and re-gated, and everything else is quarantined — the same bytes always reach the same disposition.

python

import magic
from enum import Enum, auto

class RouteDecision(Enum):
    PROCESS = auto()
    QUARANTINE = auto()
    SECONDARY_SCAN = auto()

ALLOWED_MIMES = {
    "application/pdf", "text/plain", "application/msword",
    "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
    "message/rfc822", "application/vnd.ms-outlook"
}

def classify_and_route(file_path: Path) -> RouteDecision:
    mime = magic.from_file(str(file_path), mime=True)
    if mime in ALLOWED_MIMES:
        return RouteDecision.PROCESS
    elif mime in ("application/x-executable", "application/x-dosexec", "application/zip"):
        return RouteDecision.SECONDARY_SCAN
    else:
        return RouteDecision.QUARANTINE

Chain-of-Custody Hashing & Schema Validation

Every native file must be cryptographically hashed at the point of ingestion to establish an immutable chain of custody. Streaming digest computation, as detailed in cryptographic hash generation, guarantees that memory constraints do not compromise digest accuracy. SHA-256 is the industry standard for litigation holds and production exports, and the hash must be anchored before any transformation so it certifies what the custodian produced rather than a self-consistent derivative.

Following hash generation, extracted metadata and structural attributes are validated against recognized legal data models. Validating each record against the Electronic Discovery Reference Model (EDRM) XML schema ensures that custodian, date, and file-attribute fields conform to a defensible standard. Schema validation failures do not halt the pipeline; instead, they are logged as non-fatal compliance warnings and routed to a reconciliation queue, so a single malformed record never blocks throughput while still leaving an auditable trace.

python

import xmlschema
from lxml import etree

# In production, cache the XSD locally to avoid network dependencies.
EDRM_XSD_PATH = Path("/opt/edrm/edrm_xml_v1.0.xsd")

# Compile the schema once at module load; it is immutable and reusable.
_EDRM_SCHEMA = xmlschema.XMLSchema(str(EDRM_XSD_PATH))

def _metadata_to_xml(metadata: Dict[str, Any]) -> etree._Element:
    """Serialize a flat metadata mapping into an EDRM-compatible XML element."""
    root = etree.Element("Document")
    for key, value in metadata.items():
        field = etree.SubElement(root, "Field", name=str(key))
        field.text = "" if value is None else str(value)
    return root

def validate_esi_metadata(metadata: Dict[str, Any]) -> bool:
    # xmlschema validates XML sources, not raw dicts; serialize first.
    document = _metadata_to_xml(metadata)
    try:
        _EDRM_SCHEMA.validate(document)
        return True
    except xmlschema.XMLSchemaValidationError as e:
        logger.error(f"Schema validation failed: {e}")
        return False

Downstream Handoff & Metadata Reconciliation

Once native files pass classification, hashing, and schema validation, they are serialized into a normalized payload and dispatched to extraction services. PDF & text extraction engines consume these payloads to generate searchable text layers, OCR outputs, and embedded object inventories. The ingestion stage guarantees idempotent delivery: duplicate payloads are detected via the SHA-256 digest and silently acknowledged to prevent redundant processing, which keeps retries safe and stops a broker redelivery from double-counting an item.

Post-extraction, metadata drift is common because of parser inconsistencies or timezone normalization. A reconciliation step closes the loop by comparing the original filesystem attributes against the extracted values, flagging discrepancies for legal review, and updating the central case database. The reconciliation record itself carries the intake digest, so any downstream mutation can always be traced back to the unaltered source.

Resilience & Failure Routing

A defensible pipeline is defined as much by how it fails as by how it succeeds. Errors are categorized into three severity tiers, and the tier determines the route:

Transient (retryable): Network timeouts, temporary file locks, or queue backpressure. Handled via exponential backoff with jitter, as shown in the retry helper above.
Structural (non-retryable): Corrupted file headers, unreadable sectors, or invalid encodings. Routed to a dead-letter manifest with a forensic snapshot and never retried.
Compliance (audit-only): Schema mismatches, missing custodian fields, or policy exceptions. Logged for legal-hold review without halting throughput.

The following flowchart maps each error tier to its handling path.

The error tier picks the route: only transient faults are retried; structural faults are dead-lettered with a forensic snapshot and never retried; compliance faults are logged and the item continues.

The dead-letter path is not a discard path. Every structurally failed item is written to an append-only manifest keyed by a correlation ID and, where the read got far enough, its partial digest. Because the manifest is newline-delimited JSON on a write-once store, it doubles as evidence that the pipeline accounted for every input — a completeness guarantee that matters as much under audit as the successfully processed set. A circuit breaker complements the manifest: when the structural-failure rate on a source crosses a threshold (a corrupt network share, a failing disk), intake from that source is halted rather than grinding every file into the dead-letter queue.

python

import json
from dataclasses import dataclass, asdict

@dataclass(frozen=True)
class DeadLetterRecord:
    file_path: str
    error_class: str
    error_tier: str
    sha256: str | None
    correlation_id: str
    recorded_at: float

def write_dead_letter(record: DeadLetterRecord, manifest_dir: Path) -> None:
    """Append an immutable dead-letter entry as newline-delimited JSON."""
    manifest_dir.mkdir(parents=True, exist_ok=True)
    manifest = manifest_dir / "dead_letter_manifest.jsonl"
    with manifest.open("a", encoding="utf-8") as fh:
        fh.write(json.dumps(asdict(record), sort_keys=True) + "\n")
    logger.error(json.dumps({
        "event": "dead_letter",
        "correlation_id": record.correlation_id,
        "error_tier": record.error_tier,
        "error_class": record.error_class,
    }))

Observability & Compliance Metrics

A defensible pipeline requires structured, machine-readable logging. Each ingestion event emits a JSON-formatted record containing a trace ID, file path, MIME type, hash digest, routing decision, and latency. On top of that event stream, three KPIs tell an operator whether the stage is healthy and whether it remains defensible:

Throughput — files fully ingested per unit time, labeled by disposition, so a spike in quarantines is visible immediately.
Integrity rate — the rolling share of files whose independent re-hash matches the digest recorded at intake. Anything below 100% is a chain-of-custody incident, not a performance note.
Dead-letter velocity — the rate at which files enter the dead-letter manifest. A rising slope signals source corruption before it swamps the queue.

python

from prometheus_client import Counter, Gauge, Histogram

INGEST_THROUGHPUT = Counter(
    "ingest_files_total", "Files fully ingested", ["disposition"]
)
INTEGRITY_RATE = Gauge(
    "ingest_integrity_ratio", "Rolling share of files whose re-hash matches intake"
)
DLQ_VELOCITY = Counter(
    "ingest_dead_letter_total", "Files routed to the dead-letter manifest", ["error_tier"]
)
HASH_LATENCY = Histogram(
    "ingest_hash_seconds", "Wall-clock seconds to stream-hash one file"
)

def record_disposition(disposition: str, integrity_ok: bool, seconds: float) -> None:
    INGEST_THROUGHPUT.labels(disposition=disposition).inc()
    HASH_LATENCY.observe(seconds)
    # Exponentially weighted so a single mismatch moves the gauge visibly.
    INTEGRITY_RATE.set(1.0 if integrity_ok else 0.0)

Alerting thresholds must be calibrated to trigger before memory exhaustion or queue saturation. Track ingestion_queue_depth, hash_compute_latency, mime_classification_accuracy, and quarantine_rate, whether the transport is Prometheus scrapes or OpenTelemetry spans, and page on the integrity rate the moment it leaves 1.0. For authoritative implementation references, consult the official Python asyncio documentation for event-loop tuning, NIST SP 800-107 Rev 1 for cryptographic hash compliance, and the Python hashlib reference for secure digest generation.

Conclusion

Native file ingestion is where defensibility is won or lost: it is the first point at which raw custodian data enters an automated system, and every downstream guarantee inherits from the integrity anchored here. By streaming files in bounded chunks, capping concurrency and in-flight bytes independently, classifying on content signatures rather than names, hashing before any transformation, and routing failures to an immutable dead-letter manifest, the pipeline delivers a process that is deterministic, reproducible, and auditable end to end. The honest scaling limit is the storage layer — throughput is ultimately bounded by how fast the source media can be read — but within that bound the design holds memory flat regardless of file size and keeps a complete, per-artifact record that survives scrutiny of the method itself.

Frequently Asked Questions

How do I keep a single multi-gigabyte PST from stalling the whole queue?

Cap in-flight bytes and open files with two independent limits. The shared memory counter admits only one very large container at a time while still letting small files flow, and the semaphore bounds the total number of concurrent reads. Because hashing streams in fixed 4–8 MB chunks, the worker’s memory stays flat whether the file is 4 KB or 40 GB, so the oversized item occupies its slot without blocking the small documents queued behind it once decoupled through the task broker.

Why is a file extension not enough to route a native file?

Extensions are trivially spoofed and carry no evidentiary weight. A binary renamed to .pdf still begins with its true signature — MZ for a Windows executable, %PDF for a real PDF — and only content-based detection with libmagic reads those bytes. Routing on the extension would send an executable into a text extractor and, worse, produce a disposition that cannot be defended, because the classification would not reflect what the file actually is.

What happens to a file that fails EDRM schema validation?

It is not dropped. The record is logged as a non-fatal compliance warning with a correlation ID and routed to a reconciliation queue, while the original file is preserved unaltered. This keeps throughput moving — one malformed metadata field never blocks the batch — and leaves a complete audit trail, so a reviewer can later repair the field and the item rejoins normal processing without any gap in the ledger.

No. Retries belong only to transient faults — share timeouts, temporary locks, brief backpressure — where a second attempt can plausibly succeed, and those use exponential backoff with full jitter to avoid thundering herds. Structural faults such as a corrupt header fail deterministically, so retrying them only burns throughput and inflates latency percentiles; they go straight to the dead-letter manifest with a forensic snapshot, and a circuit breaker halts a source whose structural-failure rate spikes.

Cryptographic Hash Generation — memory-aware streaming SHA-256/MD5 and audit-ledger registration that anchors intake integrity.
Accurate MIME Type Detection with libmagic — the content-signature classifier that drives routing.
PDF & Text Extraction Engines — the downstream consumers of the normalized ingestion payload.
Async Batch Processing Design — semaphore-bounded worker pools, backpressure, and dead-letter routing.
Production Compliance Frameworks — the logging and chain-of-custody controls this stage must satisfy.

Up: ESI Ingestion & Processing Workflows · Part of the eDiscovery Automation resource.

Native File Ingestion Pipelines: Implementation Architecture & Production Patterns

Memory-Aware Async Architecture & Backpressure Control

Concurrency Model, Semaphore Sizing & Retry Policy

MIME Detection & Deterministic Routing

Chain-of-Custody Hashing & Schema Validation

Downstream Handoff & Metadata Reconciliation

Resilience & Failure Routing

Observability & Compliance Metrics

Conclusion

Frequently Asked Questions

How do I keep a single multi-gigabyte PST from stalling the whole queue?

Why is a file extension not enough to route a native file?

What happens to a file that fails EDRM schema validation?

Should transient and structural errors share a retry path?

Related