Slug: fast-pdf-model Version: 1.0.0 Infer: Yes

Overview

PDF to KDDB Model

The PDF to KDDB model converts PDF documents into the Kodexa Document Database (KDDB) format, extracting text and structure for further processing. It uses an optimized high-performance approach designed for speed and efficiency, especially with large documents.

How It Works

The model reads the input PDF document
It processes the PDF in memory-efficient chunks, handling 10 pages at a time
For each chunk, it extracts:
- Text content with precise positioning
- Page structure and layout
- Word and line relationships
The extracted content is organized into a hierarchical Kodexa document structure
Original document metadata is preserved in the resulting KDDB document

Options Configuration

Option	Description
should_copy_external_data	When enabled, preserves any external data attached to the original document
should_copy_labels	When enabled, preserves document labels from the original document

Document Structure

The resulting KDDB document has the following hierarchical structure:

document
└── page
    └── content-area
        └── line
            └── word

Each node includes:

Bounding box coordinates: Precise positioning information
Text content: For word nodes, the extracted text
Spatial mixin: Enhanced spatial information for layout analysis

Process Flow

Performance Considerations

The Fast PDF model is optimized for performance in several ways:

Chunked Processing: Divides large documents into manageable chunks
Parallel Processing: Leverages multiprocessing for faster extraction
Memory Efficiency: Uses streaming approaches to minimize memory usage
Optimized Text Extraction: Focused on extracting only needed information
Multi-threading: Utilizes multiple CPU cores when available

Use Cases

This model is particularly useful for:

Document Processing Pipelines: Converting PDFs for further processing and analysis
Text Extraction: Extracting text content while preserving layout information
OCR Results Processing: Processing PDFs with embedded OCR text
Document Structure Analysis: Preparing documents for structure-aware processing
Large Document Handling: Efficiently processing large PDF files
Batch Processing: Converting multiple PDFs in high-volume scenarios

Example Usage

To convert PDF documents while preserving both external data and labels:

should_copy_external_data: true
should_copy_labels: true

Technical Details

The model uses pdfplumber for text extraction with positioning
Processing is optimized for multicore systems
Temporary files are used to minimize memory usage
The model handles large PDFs through chunking
Spatial information is added to enable layout analysis
Original document metadata is preserved in the conversion

Inference Options

The following options can be configured when using this model for inference:

Name	Label	Type	Description	Default	Required
`should_copy_external_data`	Should Copy External Data	boolean	Copy the existing external data to the new document	False	No
`should_copy_labels`	Should Copy Labels	boolean	Copy the existing labels to the new document	False	No

Model Details

Provider: Kodexa

Introduction

​Overview

​PDF to KDDB Model

​How It Works

​Options Configuration

​Document Structure

​Process Flow

​Performance Considerations

​Use Cases

​Example Usage

​Technical Details

​Inference Options

​Model Details