PDF & Text Extraction Engines: Deterministic Tiered Extraction for eDiscovery

Deterministic text extraction is the transformation step that turns opaque PDFs, scanned images, and embedded text layers into the searchable, review-ready payloads the rest of the ESI Ingestion & Processing Workflows pipeline depends on. It owns the EDRM Processing stage boundary between raw native files and indexed review content: get it wrong and privilege review, deduplication, and production all inherit silent gaps. This engine solves one precise problem — extract every recoverable character from every document, or fail loudly with a structured manifest — without ever altering the underlying evidence or breaking chain of custody. It is built around bounded async batching, strict per-worker memory ceilings, and a deterministic fallback ladder that escalates from native text parsing to cross-reference repair to rasterized OCR before it will quarantine a file.

Memory-Aware Batching & Design Rationale

PDF extraction at enterprise scale fails in two predictable ways: synchronous I/O that blocks the event loop, and unbounded object allocation that drives garbage-collection thrashing until the kernel OOM-killer terminates the worker. Naive code that opens every document eagerly and holds page objects for the life of a batch cannot survive a multi-terabyte matter, because pdfminer.six — the parser under pdfplumber — loads the entire cross-reference table, font dictionaries, and resource streams into the Python heap per document. The engine therefore runs as a stateless worker that consumes file paths emitted by the upstream native file ingestion pipeline and emits normalized JSONL, holding no state between documents.

Memory pressure is contained by three disciplines working together. First, documents are grouped into micro-batches (typically 10–25 files) so peak resident set size is a function of batch size, not collection size. Second, every PDF handle, page object, and intermediate string buffer is scoped inside a context manager and explicitly dereferenced before the next document is opened. Third, each worker process is capped at 2 GB RSS, and batch sizing is tuned down for complex documents (high-resolution /XObject scans, overlapping /Form text layers) that inflate heap usage. This is what lets long-running workers hold a stable RSS across multi-day processing windows on datasets exceeding 10 TB, rather than sawtoothing toward an OOM kill.

Async Execution Model & Backpressure

The pipeline enforces a strict producer–consumer model using asyncio.Queue with bounded capacity, as documented in the official Python asyncio queue guidelines. A bounded queue is itself the backpressure mechanism: when consumers fall behind, queue.put() blocks the producer instead of letting an in-memory backlog grow without limit. This mirrors the concurrency contract established in async batch processing design, so an extraction worker can be dropped into either an in-process event loop or a broker-backed fleet without changing its record contract.

Two operations in this engine are CPU- or C-library-bound rather than cooperatively async: pdf2image rasterization and pytesseract OCR both call into native code that will not yield to the event loop. Running them directly inside a coroutine would stall every other document sharing that loop. The implementation therefore dispatches them through asyncio.to_thread, keeping the loop responsive while the GIL is released inside the native call. Hashing is streamed in fixed 8 KB reads so digest computation never materializes a whole file in memory, and the micro-batch drain issues a single explicit gc.collect() at the end of each batch to break the reference cycles that pdfminer objects tend to form in long-lived loops.

Primary Extraction & Deterministic Fallback Routing

The primary path leverages pdfplumber for high-fidelity text-layer parsing, preserving spatial coordinates, font metadata, and embedded annotations. Detailed strategies for scaling this library — and the silent-truncation and OOM failure modes it exhibits — are covered in extracting embedded text with pdfplumber at scale. When the primary engine hits structural corruption, a missing cross-reference table, or a non-standard compression filter, the document escalates to a secondary tier. That tier first uses pikepdf to rebuild the cross-reference table and re-parse the recovered copy; if no text layer survives repair, it rasterizes the pages and applies OCR via pytesseract. Only when every tier fails is the file quarantined with a structured error manifest — never silently dropped.

Routing is deterministic because failures are mapped to a controlled vocabulary rather than raw exception text. Each category (MALFORMED_XREF, MISSING_TEXT_LAYER, ENCRYPTED_NO_KEY, STREAM_DECOMPRESSION_FAIL) selects a specific fallback strategy and logging payload, and every routing decision is recorded so the escalation path a document took is reconstructable months later during a defensibility challenge. The table below shows the routing contract each category triggers.

Error category	Detected at	Fallback action	Terminal status if unrecovered
`MALFORMED_XREF`	pdfplumber open / PDFSyntaxError	`pikepdf` repair + reparse	Rasterize + OCR, then quarantine
`MISSING_TEXT_LAYER`	Empty text after primary parse	Rasterize + OCR	Quarantine
`STREAM_DECOMPRESSION_FAIL`	Filter decode error	`pikepdf` repair + reparse	Quarantine
`ENCRYPTED_NO_KEY`	Password/permission wall	None — key required	Quarantine (await decryption key)
`UNKNOWN`	Uncategorized exception	Rasterize + OCR	Quarantine with raw trace

The flowchart below depicts the tiered fallback, where each failed tier escalates to the next before quarantine.

Resilience, Quarantine & Dead-Letter Manifests

A defensible engine treats an unextractable document as a first-class outcome, not an exception to swallow. When the fallback ladder is exhausted, the file is routed to a quarantine set with a self-describing manifest: the original SHA-256 and MD5 digests, the final error category, the tiers attempted, a UTC timestamp, and the engine version. That manifest is what lets a legal team demonstrate that no item was silently lost — every file either reached the review index with a validated payload or appears in the dead-letter set with a documented reason. This is the same accountability model the parent async batch processing layer applies to its dead-letter queue, and it is what production compliance auditors expect to see reconciled against the ingestion count.

Quarantine also enforces a circuit boundary. An ENCRYPTED_NO_KEY document, for example, must not loop through rasterization and OCR forever — OCR of an encrypted page yields noise, not evidence — so its category short-circuits directly to the dead-letter set to await a decryption key rather than burning worker time and emitting garbage text that would pollute privilege review. Transient failures (a locked file on a network share, a momentary I/O error) are the only class eligible for retry; structural failures are terminal by design so the same corrupt document cannot pin a worker across the whole batch.

Chain-of-Custody & Cryptographic Verification

Before any text is extracted, the engine computes SHA-256 and MD5 hashes to establish a baseline for evidentiary integrity. Those digests are cross-referenced against the upstream manifest to detect bit-rot or unauthorized modification in transit, and this step is formally integrated into the cryptographic hash generation protocol that anchors chain of custody across the whole pipeline. Because the same digest travels with the payload, every extracted character — native, repaired, or OCR’d — remains traceable to the unaltered source bitstream, and the SHA-256 value doubles as the deduplication key consumed downstream by hash-based deduplication strategies. Each emitted payload therefore carries the original hashes, processing timestamp, engine version, and extraction status to form an immutable audit trail aligned with EDRM guidance and the site’s production compliance frameworks.

Production-Grade Implementation

The following implementation demonstrates a complete, auditable extraction worker. It features bounded async queues, explicit resource cleanup, Pydantic schema validation, structured JSON logging, and deterministic fallback routing.

python

import asyncio
import hashlib
import json
import logging
import gc
from pathlib import Path
from typing import Dict, Any, Optional
from enum import Enum

import pdfplumber
import pikepdf
import pytesseract
from pdf2image import convert_from_path
from pdfminer.pdfparser import PDFSyntaxError
from pydantic import BaseModel, ValidationError
from datetime import datetime, timezone

# Structured logging configuration for SIEM ingestion
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger("pdf_extraction_engine")

class ExtractionStatus(str, Enum):
    SUCCESS = "SUCCESS"
    FALLBACK_OCR = "FALLBACK_OCR"
    QUARANTINED = "QUARANTINED"

class ErrorCategory(str, Enum):
    MALFORMED_XREF = "MALFORMED_XREF"
    MISSING_TEXT_LAYER = "MISSING_TEXT_LAYER"
    ENCRYPTED_NO_KEY = "ENCRYPTED_NO_KEY"
    STREAM_DECOMPRESSION_FAIL = "STREAM_DECOMPRESSION_FAIL"
    UNKNOWN = "UNKNOWN"

class ExtractionPayload(BaseModel):
    file_path: str
    sha256: str
    md5: str
    status: ExtractionStatus
    extracted_text: str
    page_count: int
    error_category: Optional[ErrorCategory] = None
    processing_timestamp: str
    engine_version: str = "1.0.0"

class ExtractionWorker:
    def __init__(self, queue_size: int = 50, batch_size: int = 15):
        self.queue: asyncio.Queue = asyncio.Queue(maxsize=queue_size)
        self.batch_size = batch_size
        self._shutdown = False

    async def compute_hashes(self, file_path: Path) -> Dict[str, str]:
        """Deterministic cryptographic verification before extraction."""
        sha256 = hashlib.sha256()
        md5 = hashlib.md5()
        with open(file_path, "rb") as f:
            for chunk in iter(lambda: f.read(8192), b""):
                sha256.update(chunk)
                md5.update(chunk)
        return {"sha256": sha256.hexdigest(), "md5": md5.hexdigest()}

    async def extract_primary(self, file_path: Path) -> Dict[str, Any]:
        """High-fidelity text layer extraction via pdfplumber."""
        try:
            with pdfplumber.open(file_path) as pdf:
                pages = [page.extract_text() for page in pdf.pages]
                return {
                    "text": "\n".join(filter(None, pages)),
                    "page_count": len(pdf.pages),
                    "status": ExtractionStatus.SUCCESS
                }
        except PDFSyntaxError as e:
            return {"error": str(e), "category": ErrorCategory.MALFORMED_XREF}
        except Exception as e:
            return {"error": str(e), "category": ErrorCategory.UNKNOWN}

    async def extract_fallback(self, file_path: Path) -> Dict[str, Any]:
        """Secondary tier: repair-and-reparse, then rasterized OCR."""
        # First, attempt to recover a damaged cross-reference table with pikepdf,
        # then re-run the high-fidelity text-layer extractor on the repaired copy.
        try:
            repaired = file_path.with_suffix(".repaired.pdf")
            with pikepdf.open(file_path) as pdf:
                pdf.save(repaired)
            with pdfplumber.open(repaired) as pdf:
                pages = [page.extract_text() for page in pdf.pages]
            text = "\n".join(filter(None, pages))
            if text.strip():
                return {
                    "text": text,
                    "page_count": len(pages),
                    "status": ExtractionStatus.FALLBACK_OCR,
                }
        except Exception:
            pass

        # Final tier: rasterize each page and OCR it. pytesseract operates on
        # images, so the PDF must be converted to page images first. Both calls
        # are C-library-bound, so dispatch them off the event loop.
        try:
            images = await asyncio.to_thread(convert_from_path, str(file_path))
            ocr_pages = [
                await asyncio.to_thread(pytesseract.image_to_string, image, lang="eng")
                for image in images
            ]
            return {
                "text": "\n".join(ocr_pages),
                "page_count": len(ocr_pages),
                "status": ExtractionStatus.FALLBACK_OCR,
            }
        except Exception as e:
            return {"error": str(e), "category": ErrorCategory.STREAM_DECOMPRESSION_FAIL}

    def build_payload(self, file_path: Path, hashes: Dict[str, str], result: Dict[str, Any]) -> ExtractionPayload:
        """Schema validation and payload normalization."""
        return ExtractionPayload(
            file_path=str(file_path),
            sha256=hashes["sha256"],
            md5=hashes["md5"],
            status=result.get("status", ExtractionStatus.QUARANTINED),
            extracted_text=result.get("text", ""),
            page_count=result.get("page_count", 0),
            error_category=result.get("category"),
            processing_timestamp=datetime.now(timezone.utc).isoformat()
        )

    async def process_file(self, file_path: Path) -> Optional[ExtractionPayload]:
        """Orchestrates extraction with explicit memory boundaries."""
        hashes = await self.compute_hashes(file_path)
        primary = await self.extract_primary(file_path)

        if primary.get("status") == ExtractionStatus.SUCCESS:
            return self.build_payload(file_path, hashes, primary)

        logger.warning(f"Primary extraction failed for {file_path.name}. Routing to fallback tier.")
        fallback = await self.extract_fallback(file_path)

        if fallback.get("status") == ExtractionStatus.FALLBACK_OCR:
            return self.build_payload(file_path, hashes, fallback)

        logger.error(f"Extraction exhausted all tiers for {file_path.name}. Quarantining.")
        return self.build_payload(file_path, hashes, {
            "status": ExtractionStatus.QUARANTINED,
            "category": fallback.get("category", ErrorCategory.UNKNOWN)
        })

    async def worker_loop(self):
        """Stateless consumer loop with bounded concurrency."""
        while not self._shutdown:
            batch = []
            for _ in range(self.batch_size):
                try:
                    item = await asyncio.wait_for(self.queue.get(), timeout=2.0)
                    batch.append(item)
                except asyncio.TimeoutError:
                    break

            if not batch:
                continue

            for file_path in batch:
                try:
                    payload = await self.process_file(file_path)
                    if payload:
                        logger.info(json.dumps(payload.model_dump()))
                except ValidationError as ve:
                    logger.error(f"Schema validation failed: {ve}")
                except Exception as e:
                    logger.exception(f"Unhandled worker exception: {e}")
                finally:
                    self.queue.task_done()

            # Explicit memory reclamation
            gc.collect()

    async def enqueue(self, file_path: Path):
        await self.queue.put(file_path)

    async def shutdown(self):
        self._shutdown = True
        await self.queue.join()

Observability & Compliance Metrics

An extraction engine is only defensible if its behaviour is measurable at run time. Three KPIs capture whether the subsystem is healthy without drowning operators in raw logs:

Extraction throughput — documents finalized per second per worker. A sustained drop signals oversized documents, disk contention, or a fallback tier being hit far more often than baseline.
Integrity rate — the fraction of documents whose SHA-256 matches the upstream manifest digest. Anything below 100% is a chain-of-custody incident, not a performance blip, and must page a human.
Quarantine velocity — dead-letter documents accumulated per minute. A rising slope usually means a whole custodian’s export shares one defect (a bad export profile, a uniform encryption wall) rather than isolated corruption.

Export these as counters and gauges so they can be scraped by Prometheus or an OpenTelemetry collector alongside the structured JSON logs:

python

from dataclasses import dataclass, field

@dataclass
class ExtractionMetrics:
    """Snapshot of the three core KPIs for one worker interval."""
    extracted: int = 0
    integrity_verified: int = 0
    quarantined: int = 0
    _last_flush: float = field(default=0.0)

    def record(self, status: ExtractionStatus, hash_ok: bool) -> None:
        if status == ExtractionStatus.QUARANTINED:
            self.quarantined += 1
        else:
            self.extracted += 1
        if hash_ok:
            self.integrity_verified += 1

    def as_labels(self) -> Dict[str, float]:
        total = max(self.extracted + self.quarantined, 1)
        return {
            "extraction_throughput_docs": float(self.extracted),
            "integrity_rate": self.integrity_verified / total,
            "quarantine_velocity_docs": float(self.quarantined),
        }

Alerting thresholds should trip before memory exhaustion rather than after: watch worker RSS against the 2 GB ceiling, queue depth against its bound, and integrity rate against 1.0. When any of the three crosses its threshold, the correct response is to pause intake and drain in-flight work, exactly as the ingestion layer does under backpressure.

Conclusion

A production PDF and text extraction engine earns its place in a defensible pipeline by making three guarantees simultaneously: identical input yields identical output for a fixed engine version, worker memory stays bounded across multi-day runs, and every document ends in exactly one place — a validated payload with its original hashes, or a dead-letter manifest with a documented reason. Bounded async batching keeps the event loop responsive, the tiered fallback ladder recovers text that naive extraction would drop, and the KPI instrumentation turns those guarantees into signals an auditor can verify. The scaling limit is deliberate and known: a single worker is bounded by one host’s cores and 2 GB ceiling, so throughput grows by adding stateless workers behind a shared broker, never by relaxing the memory or integrity controls.

Frequently Asked Questions

Why not just OCR every PDF instead of maintaining a tiered fallback?

Because OCR discards fidelity and defensibility that native extraction preserves. A PDF with a real text layer yields exact, coordinate-accurate characters; rasterizing and OCR-ing it introduces recognition error, drops font and position metadata, and costs an order of magnitude more compute per page. OCR is the last resort for image-only or unrecoverable documents, not the default. Running the primary pdfplumber path first — then pikepdf repair — means the vast majority of documents never touch OCR, which keeps both accuracy and throughput high while reserving expensive recognition for the files that genuinely need it.

How do I keep worker RSS under the 2 GB ceiling when a single document is enormous?

The ceiling is enforced per batch and per document. Size batches down for known-heavy inputs, scope every PDF handle inside a context manager so it is released before the next open, and run the explicit gc.collect() at each batch boundary to break pdfminer reference cycles. For a pathological single document — thousands of pages or embedded high-resolution scans — process it in a dedicated small-batch worker and, if it still breaches the ceiling, treat it as a fallback candidate: rasterize page ranges incrementally rather than loading the whole object graph at once.

What should happen to an encrypted PDF with no available key?

It must short-circuit straight to quarantine with the ENCRYPTED_NO_KEY category — never loop through rasterization and OCR. OCR of an encrypted or blank-rendered page produces noise that would pollute privilege review and inflate false-positive text. The dead-letter manifest records the digest and category so the document can be re-fed once a decryption key is supplied, keeping the ingestion count reconcilable in the meantime.

How do I guarantee identical output across reprocessing runs for a litigation hold?

Pin the engine version, the fallback thresholds, and the OCR language/model, and record all three in every payload. Determinism holds when the same input bytes, the same code path, and the same OCR configuration produce the same text; a version bump to pdfplumber, pikepdf, or the Tesseract model can legitimately change output, which is exactly why the engine_version field travels with each record. Reprocessing a hold therefore means reprocessing under the recorded version, and the SHA-256 on the source guarantees you are re-extracting the identical bitstream.

Native File Ingestion Pipelines — content-signature MIME detection and format-family routing that feeds classified PDFs into this engine.
Cryptographic Hash Generation — the streaming SHA-256/MD5 digest protocol this engine runs before any extraction.
Async Batch Processing Design — the bounded-queue, dead-letter concurrency contract this worker plugs into.
Extracting Embedded Text with pdfplumber at Scale — deep dive on silent truncation and OOM failure modes of the primary extraction path.
Production Compliance Frameworks — the evidentiary standards the integrity and quarantine guarantees must satisfy.

Up one level: ESI Ingestion & Processing Workflows — the full intake-to-production pipeline this extraction stage sits inside.