Slug: tif-to-pdf-model Version: 1.0.0 Infer: Yes

Overview

TIF to PDF Model

The TIF to PDF model converts TIFF image files to PDF documents while preserving the visual content and enabling text extraction. This model is particularly useful for integrating scanned TIFF documents into PDF-based workflows and making their content searchable and processable.

How It Works

  1. The model reads the input TIFF file (supporting both single and multi-page TIFFs)
  2. For each page in the TIFF file:
    • The image is extracted as a separate frame
    • Image quality is preserved during conversion
  3. All frames are combined into a single PDF document
  4. The PDF is processed with pdfplumber to extract text and structure
  5. A fully structured Kodexa document is created with:
    • Page nodes representing each page
    • Content area nodes containing the text
    • Line and word nodes capturing the text content and positioning

Process Flow

Document Structure

The resulting document will have the following structure:

document
└── page (for each TIFF frame)
    └── content-area
        └── line
            └── word

Each node includes:

  • Bounding box coordinates: Precise positioning information
  • Text content: For word nodes, the extracted text
  • PDF mixin: PDF-specific features and capabilities

Use Cases

This model is particularly useful for:

  • Legacy Document Conversion: Converting TIFF archives to more usable PDF format
  • Document Standardization: Standardizing mixed-format documents to PDF
  • OCR Integration: Preparing scanned documents for OCR processing
  • Workflow Integration: Incorporating TIFF-based documents into PDF workflows
  • Document Processing Pipelines: Enabling further processing of TIFF-based content

Technical Details

  • The model uses img2pdf for high-quality TIFF to PDF conversion
  • PIL/Pillow is used for TIFF frame extraction and processing
  • The conversion preserves the original image quality and resolution
  • Text extraction is performed using pdfplumber on the converted PDF
  • The model works with both single-page and multi-page TIFF files
  • Processing is optimized for memory efficiency even with large TIFF files
  • The original document metadata is preserved in the resulting PDF

Limitations

  • Text extraction quality depends on the clarity of the original TIFF image
  • Very large TIFF files may require additional processing time
  • Compression artifacts in the original TIFF may affect text extraction quality

Model Details

  • Provider: Kodexa