PDF to KDDB Model
Creates a Kodexa Document from the given PDF
Slug: fast-pdf-model
Version: 1.0.0
Infer: Yes
Overview
PDF to KDDB Model
The PDF to KDDB model converts PDF documents into the Kodexa Document Database (KDDB) format, extracting text and structure for further processing. It uses an optimized high-performance approach designed for speed and efficiency, especially with large documents.
How It Works
- The model reads the input PDF document
- It processes the PDF in memory-efficient chunks, handling 10 pages at a time
- For each chunk, it extracts:
- Text content with precise positioning
- Page structure and layout
- Word and line relationships
- The extracted content is organized into a hierarchical Kodexa document structure
- Original document metadata is preserved in the resulting KDDB document
Options Configuration
Option | Description |
---|---|
should_copy_external_data | When enabled, preserves any external data attached to the original document |
should_copy_labels | When enabled, preserves document labels from the original document |
Document Structure
The resulting KDDB document has the following hierarchical structure:
Each node includes:
- Bounding box coordinates: Precise positioning information
- Text content: For word nodes, the extracted text
- Spatial mixin: Enhanced spatial information for layout analysis
Process Flow
Performance Considerations
The Fast PDF model is optimized for performance in several ways:
- Chunked Processing: Divides large documents into manageable chunks
- Parallel Processing: Leverages multiprocessing for faster extraction
- Memory Efficiency: Uses streaming approaches to minimize memory usage
- Optimized Text Extraction: Focused on extracting only needed information
- Multi-threading: Utilizes multiple CPU cores when available
Use Cases
This model is particularly useful for:
- Document Processing Pipelines: Converting PDFs for further processing and analysis
- Text Extraction: Extracting text content while preserving layout information
- OCR Results Processing: Processing PDFs with embedded OCR text
- Document Structure Analysis: Preparing documents for structure-aware processing
- Large Document Handling: Efficiently processing large PDF files
- Batch Processing: Converting multiple PDFs in high-volume scenarios
Example Usage
To convert PDF documents while preserving both external data and labels:
Technical Details
- The model uses pdfplumber for text extraction with positioning
- Processing is optimized for multicore systems
- Temporary files are used to minimize memory usage
- The model handles large PDFs through chunking
- Spatial information is added to enable layout analysis
- Original document metadata is preserved in the conversion
Inference Options
The following options can be configured when using this model for inference:
Name | Label | Type | Description | Default | Required |
---|---|---|---|---|---|
should_copy_external_data | Should Copy External Data | boolean | Copy the existing external data to the new document | False | No |
should_copy_labels | Should Copy Labels | boolean | Copy the existing labels to the new document | False | No |
Model Details
- Provider: Kodexa