Slug: fast-pdf-model Version: 1.0.0 Infer: Yes

Overview

PDF to KDDB Model

The PDF to KDDB model converts PDF documents into the Kodexa Document Database (KDDB) format, extracting text and structure for further processing. It uses an optimized high-performance approach designed for speed and efficiency, especially with large documents.

How It Works

  1. The model reads the input PDF document
  2. It processes the PDF in memory-efficient chunks, handling 10 pages at a time
  3. For each chunk, it extracts:
    • Text content with precise positioning
    • Page structure and layout
    • Word and line relationships
  4. The extracted content is organized into a hierarchical Kodexa document structure
  5. Original document metadata is preserved in the resulting KDDB document

Options Configuration

OptionDescription
should_copy_external_dataWhen enabled, preserves any external data attached to the original document
should_copy_labelsWhen enabled, preserves document labels from the original document

Document Structure

The resulting KDDB document has the following hierarchical structure:

document
└── page
    └── content-area
        └── line
            └── word

Each node includes:

  • Bounding box coordinates: Precise positioning information
  • Text content: For word nodes, the extracted text
  • Spatial mixin: Enhanced spatial information for layout analysis

Process Flow

Performance Considerations

The Fast PDF model is optimized for performance in several ways:

  • Chunked Processing: Divides large documents into manageable chunks
  • Parallel Processing: Leverages multiprocessing for faster extraction
  • Memory Efficiency: Uses streaming approaches to minimize memory usage
  • Optimized Text Extraction: Focused on extracting only needed information
  • Multi-threading: Utilizes multiple CPU cores when available

Use Cases

This model is particularly useful for:

  • Document Processing Pipelines: Converting PDFs for further processing and analysis
  • Text Extraction: Extracting text content while preserving layout information
  • OCR Results Processing: Processing PDFs with embedded OCR text
  • Document Structure Analysis: Preparing documents for structure-aware processing
  • Large Document Handling: Efficiently processing large PDF files
  • Batch Processing: Converting multiple PDFs in high-volume scenarios

Example Usage

To convert PDF documents while preserving both external data and labels:

should_copy_external_data: true
should_copy_labels: true

Technical Details

  • The model uses pdfplumber for text extraction with positioning
  • Processing is optimized for multicore systems
  • Temporary files are used to minimize memory usage
  • The model handles large PDFs through chunking
  • Spatial information is added to enable layout analysis
  • Original document metadata is preserved in the conversion

Inference Options

The following options can be configured when using this model for inference:

NameLabelTypeDescriptionDefaultRequired
should_copy_external_dataShould Copy External DatabooleanCopy the existing external data to the new documentFalseNo
should_copy_labelsShould Copy LabelsbooleanCopy the existing labels to the new documentFalseNo

Model Details

  • Provider: Kodexa