Debugging Silent Text Truncation and OOM Exhaustion in pdfplumber at Scale

When pdfplumber scales across high-volume PDF & Text Extraction Engines, the two failures that break production almost never surface as a clean Python exception. The first is an unhandled SIGKILL — exit code 137 with no traceback — when a worker’s resident set size breaches its cgroup memory limit and the kernel OOM-killer reaps it mid-document. The second is silent text truncation: extract_text() returns fewer characters than the page actually contains, or skips pages entirely, and the payload still passes schema validation. Both hit the EDRM Processing stage where native files become searchable review content, and both violate the same compliance boundary — forensic completeness. Downstream privilege review, indexing, and hash-based deduplication strategies all assume every recoverable character was captured; a partial extract creates a defensible gap in a litigation hold that opposing counsel will probe first.

Diagnostic Log Signatures

Neither failure announces itself in the extraction library. The OOM kill happens in kernel space, and the truncation happens inside pdfminer.six with warnings suppressed by default. You confirm them from the surrounding evidence — the kernel ring buffer, the shell’s exit status, and the audit record — not from a stack trace:

text

# Kernel OOM-killer fired because the worker breached its cgroup memory limit
$ dmesg | tail -n 3
[184521.339912] Memory cgroup out of memory: Killed process 4471 (python)
                total-vm:3418112kB, anon-rss:2094880kB, file-rss:0kB
[184521.341002] oom_reaper: reaped process 4471 (python), now anon-rss:0kB

# The worker exits with a signal, not an exception — no Python traceback at all
$ echo $?
137                         # 128 + SIGKILL(9): the process was killed, never raised

# pdfminer stalls on one heavy object, then the process simply vanishes
DEBUG pdfminer.pdfinterp: processing /Page 812 /XObject Im3 (14.7 MB inline stream)
DEBUG pdfminer.cmapdb:   get_cmap: bad /ToUnicode CMap -> defaulting to last glyph
WARN  esi.extractor:     page 812 extract_text() -> 0 chars (page 811 was 3140 chars)

# Silent truncation only surfaces later, in the structured audit JSON
{"source_file":"CUST-0442.pdf","pages_processed":811,"truncation_detected":true,
 "extraction_status":"complete"}

Two lines are diagnostic. Exit 137 with no traceback is always a kill signal, not application logic — pair it with the dmesg Memory cgroup out of memory line and the cause is unambiguous. A page whose character count collapses to zero between two adjacent, similar pages (3140 chars then 0 chars) alongside a bad /ToUnicode CMap note is truncation, not a genuinely blank page. Symptom checklist:

Worker RSS climbs linearly across a batch and never falls after a garbage-collection cycle.
A worker disappears with exit code 137 and no Python exception in the application log.
dmesg / container orchestrator logs show oom-kill, oom_reaper, or cgroup memory-pressure events.
The audit JSON reports pages_processed below the document’s real page count with truncation_detected: true.
pdfminer debug output stalls at a specific /XObject, /Stream, or /ToUnicode object before the process ends.
Reprocessing the same file reproduces the same short character count — deterministic, which rules out transient I/O.

Root-Cause Breakdown

pdfplumber delegates parsing to pdfminer.six, and both failure modes trace to how that parser holds and decodes objects. Four contributing factors turn a working script into an unstable worker at ESI scale:

Whole-document heap loading with cross-page retention. On pdfplumber.open(), the library reads the entire cross-reference table, font dictionaries, and resource streams into the Python heap. Iterating for page in pdf.pages: retains references to prior page objects to preserve color-space and glyph mappings, so resident memory grows monotonically instead of releasing per page.
Synchronous decode of heavy embedded streams. /XObject image streams, high-resolution embedded scans, and /Form objects with overlapping text layers are decompressed synchronously into memory. A handful of multi-megabyte inline streams fragment the heap and push RSS past the cgroup ceiling between garbage-collection cycles, at which point the OOM-killer intervenes with no chance for a Python-level handler to run.
Malformed /ToUnicode CMaps causing silent truncation. When a font’s character-to-Unicode map is damaged or absent, the extraction engine cannot resolve glyphs and quietly yields an empty or partial string for the page rather than raising. Overlapping text operators that share an identical bounding box compound this: the engine defaults to the last rendered string and drops the rest. In redacted exhibits, underlying text may be visually masked but not stripped, so the omission is legally material yet passes validation.
Suppressed PDFSyntaxError warnings. pdfminer logs structural defects at a level most pipelines never surface, so a corrupted object graph degrades extraction silently. Without explicit page-count reconciliation, a run that dropped pages looks identical to a clean one.

Remediation Architecture

The fix is not a bigger memory limit — it is a worker that caps its own memory, validates completeness, and records what it did. Five controls close the gaps above: a soft RLIMIT_AS ceiling that converts an unkillable OOM into a catchable MemoryError; a per-page tolerance fallback for pages that return empty; explicit dereferencing plus periodic gc.collect() to break pdfminer reference cycles; a post-run page-count reconciliation that sets truncation_detected; and a streaming SHA-256 anchored to the cryptographic hash generation protocol so every audit record ties back to the unaltered source bitstream.

python

import json
import hashlib
import resource
import logging
import traceback
import gc
import pdfplumber
from typing import Dict, List, Tuple, Optional

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("esi.extractor")


def compute_sha256(file_path: str) -> str:
    """Stream a chain-of-custody digest without materializing the file in memory."""
    sha256 = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()


def extract_with_audit(
    pdf_path: str,
    max_rss_mb: int = 1024,
    min_expected_pages: Optional[int] = None,
) -> Tuple[Dict[str, object], str]:
    """Extract text under a hard memory cap, validate completeness, and emit an audit record."""
    audit: Dict[str, object] = {
        "source_file": str(pdf_path),
        "sha256": compute_sha256(pdf_path),
        "pages_processed": 0,
        "pages_failed": [],
        "truncation_detected": False,
        "extraction_status": "pending",
        "memory_limit_mb": max_rss_mb,
    }
    extracted_text: List[str] = []

    # Convert an unkillable OOM into a catchable MemoryError. Set the soft limit
    # only; the kernel still enforces the cgroup hard limit as a backstop.
    try:
        _, hard = resource.getrlimit(resource.RLIMIT_AS)
        resource.setrlimit(resource.RLIMIT_AS, (max_rss_mb * 1024 * 1024, hard))
    except (ValueError, OSError) as e:
        logger.warning("Memory limit enforcement skipped: %s", e)

    try:
        with pdfplumber.open(pdf_path) as pdf:
            total_pages = len(pdf.pages)

            for idx, page in enumerate(pdf.pages, start=1):
                try:
                    page_text = page.extract_text()

                    # Empty return often means a malformed CMap, not a blank page.
                    # Retry with relaxed tolerances before conceding the page.
                    if page_text is None or not page_text.strip():
                        logger.warning("page %d empty; applying tolerance fallback", idx)
                        page_text = page.extract_text(x_tolerance=2, y_tolerance=2)

                    if page_text and page_text.strip():
                        extracted_text.append(page_text)
                    else:
                        audit["pages_failed"].append(idx)
                        logger.warning("page %d yielded no text after fallback", idx)

                    audit["pages_processed"] += 1

                    # Break pdfminer's cross-page reference retention explicitly.
                    page.flush_cache()
                    del page
                    if idx % 50 == 0:
                        gc.collect()

                except Exception as e:
                    logger.error("extraction failed on page %d: %s", idx, e)
                    audit["pages_failed"].append(idx)
                    traceback.print_exc()

        # Reconcile against the real page count — the only signal that catches
        # a silently dropped page range.
        audit["extraction_status"] = "complete"
        audit["truncation_detected"] = audit["pages_processed"] != total_pages
        if min_expected_pages and audit["pages_processed"] < min_expected_pages:
            audit["extraction_status"] = "validation_failed"
            logger.critical(
                "page count mismatch: expected >=%d, got %d",
                min_expected_pages, audit["pages_processed"],
            )

    except MemoryError:
        audit["extraction_status"] = "oom_terminated"
        logger.critical("soft memory cap breached; halting to preserve pipeline stability")
    except Exception as e:
        audit["extraction_status"] = "fatal_error"
        logger.critical("pipeline failure: %s", e)

    logger.info(json.dumps(audit))
    return audit, "\n".join(extracted_text)

The RLIMIT_AS soft cap is the pivot: instead of the kernel silently reaping the process, an allocation past max_rss_mb raises MemoryError, which the worker catches, marks oom_terminated, and routes to a fallback tier — preserving the audit record the exit-137 kill destroyed. This mirrors the dead-letter accountability model in async batch processing design: a document either reaches the index with a validated payload or lands in a dead-letter set with a documented reason, never vanishing between the two.

Defensible Recovery & Validation Protocol

When truncation or an OOM event fires, recovery prioritizes audit preservation over throughput. The flowchart below shows how the memory-capped extractor detects failure and drives the recovery path.

Isolate and hash. Quarantine the affected PDF and verify its SHA-256 matches the ingestion manifest to rule out bit-rot or corruption in transit before blaming the extractor.
Reconstruct the audit trail. Parse the structured JSON record and cross-reference pages_failed and truncation_detected against the original page count; flag any discrepancy for manual review before it reaches downstream indexing.
Route to a fallback tier. Re-run failed documents through an alternative parser — PyMuPDF for a stubborn text layer, rasterize-plus-OCR for image-only pages — under strict version control so the recovered output remains reproducible.
Document for compliance. Attach the audit JSON to the processing manifest, recording the OOM threshold, memory-limit configuration, and fallback path so the run satisfies EDRM defensibility guidance during a privilege-log challenge.

For containerized deployments, set --memory / --memory-swap in Docker or Kubernetes resource quotas to match max_rss_mb, then monitor RSS through cgroup metrics and trip a circuit breaker before the kernel OOM-killer ever engages. See the Python resource module documentation for platform-specific limit behavior and the pdfplumber documentation for layout-analysis tuning.

Verification Checklist

A soft RLIMIT_AS cap is set so an over-allocation raises a catchable MemoryError instead of an exit-137 kill.
Every page that returns empty is retried with relaxed x_tolerance/y_tolerance before it is recorded as failed.
Page objects are explicitly dereferenced and gc.collect() runs on a fixed interval to break pdfminer reference cycles.
pages_processed is reconciled against the true page count and truncation_detected is set whenever they differ.
The audit JSON carries the source SHA-256, failed-page list, status, and memory-limit configuration for every document.
Container memory quotas match max_rss_mb, with RSS scraped from cgroup metrics and a circuit breaker tripping before the kernel intervenes.
A full re-run shows no exit-137 kills and no truncation_detected: true records outside the documented quarantine set.

Conclusion

Silent truncation and OOM exhaustion are defensibility failures, not just reliability bugs: one drops legally material text, the other destroys the audit record that proves nothing was dropped. Capping worker memory with a soft RLIMIT_AS turns an unkillable kill into a catchable, loggable event; the per-page tolerance fallback rescues text that a malformed CMap would silently discard; and reconciling processed pages against the real count converts an invisible gap into an explicit truncation_detected flag. With every outcome anchored to a source SHA-256 and written to an immutable audit trail, pdfplumber scales across multi-terabyte matters while keeping the forensic completeness the Processing stage exists to guarantee.

Frequently Asked Questions

Why does my worker die with exit code 137 but no Python traceback?

Exit 137 is 128 + 9 — the process received SIGKILL. No Python code runs after a SIGKILL, which is why there is never a traceback: the kernel’s OOM-killer terminated the worker because its resident set breached the cgroup memory limit while pdfminer decoded a heavy /XObject or font stream. Confirm it with dmesg, where you will see a Memory cgroup out of memory: Killed process line matching the worker PID. The fix is to set a soft RLIMIT_AS below the cgroup hard limit so the over-allocation raises a catchable MemoryError your handler can log and route, rather than letting the kernel reap the process silently.

How do I detect truncation when `extract_text()` returns without raising?

Never trust a run that did not raise — reconcile counts instead. Capture len(pdf.pages) before iterating and compare it to the pages that actually produced text; any shortfall sets truncation_detected. A page whose character count collapses to zero between two similar adjacent pages, especially alongside a bad /ToUnicode CMap debug note, is truncation rather than a genuinely blank page. Retry those pages with relaxed x_tolerance and y_tolerance before conceding them, and record the failed page numbers in the audit JSON so a reviewer can verify completeness against the original document.

Will lowering the batch size alone stop the OOM kills?

It reduces peak RSS but does not make the worker defensible on its own. A single pathological document — thousands of pages or embedded high-resolution scans — can still breach the ceiling inside one batch. Combine smaller batches with explicit per-page dereferencing, a periodic gc.collect() to break pdfminer reference cycles, and the soft RLIMIT_AS cap so the worst case degrades into a logged fallback instead of a SIGKILL. For a document that still breaches the cap in isolation, process it in a dedicated small-batch worker and rasterize page ranges incrementally rather than loading the whole object graph at once.

PDF & Text Extraction Engines — the tiered extraction subsystem whose primary pdfplumber path these failure modes affect.
Cryptographic Hash Generation — the streaming SHA-256 anchor recorded in every extraction audit record.
Async Batch Processing Design — the bounded-queue and dead-letter model that catches an oom_terminated document instead of losing it.
Hash-Based Deduplication Strategies — the downstream stage that inherits any silent gap a truncated extract leaves behind.
Production Compliance Frameworks — the forensic-completeness and reproducibility standards these controls must satisfy.

Up one level: PDF & Text Extraction Engines — the extraction architecture this diagnostic page troubleshoots.