Introduction

The cookie-cutter-kodexa-transformer-model is a project template that helps you quickly set up a new Kodexa transformer model project with the right structure and dependencies. Transformer models in Kodexa are designed to process documents and transform the extracted data according to your business needs. The template creates a model that can be deployed to a Kodexa platform and integrated into document processing pipelines.

This documentation will guide you through:

  • Installing the prerequisites
  • Creating a new project from the template
  • Understanding the project structure
  • Setting up your development environment in VS Code
  • Example usage scenarios

Prerequisites

Before using this cookiecutter template, ensure you have the following installed:

  1. Python 3.11+: The template is designed to work with Python 3.11 or higher
  2. Cookiecutter: The templating tool that will create your project
  3. Git: For version control
  4. Visual Studio Code: For development (recommended)
  5. Poetry: For dependency management
  6. Kodexa CLI: For deploying models and generating data classes

Installing Required Tools

You can install the required tools using pip:

# Install cookiecutter
pip install cookiecutter

# Install poetry
pip install poetry

# Install Kodexa CLI
pip install kodexa-cli

Creating a New Project

Once you have the prerequisites installed, you can create a new project from the template by running:

cookiecutter https://github.com/kodexa-labs/cookie-cutter-kodexa-transformer-model

You’ll be prompted to provide several configuration values defined in the cookiecutter.json file:

project_name [My Kodexa Transformer Model]: Invoice Data Transformer
project_slug [invoice-data-transformer]: 
pkg_name [invoice_data_transformer]: 
project_short_description [Skeleton project created by Cookiecutter Kodexa Transformer Model]: A model that transforms extracted invoice data
full_name [Kodexa Support]: Jane Smith
email [support@kodexa.com]: jane.smith@example.com
github_username [kodexa-ai]: janesmith
version [0.1.0]: 
org_slug [my-org]: janes-org
taxonomy_ref []: invoice-taxonomy/v1

These values will be used to customize your project. Here’s what each prompt means:

  • project_name: The human-readable name of your project
  • project_slug: The slug for your model (automatically derived from project_name)
  • pkg_name: The Python package name (automatically derived from project_name)
  • project_short_description: A short description of what your model does
  • full_name: Your name or your organization’s name
  • email: Contact email for the project
  • github_username: Your GitHub username or organization
  • version: The initial version of your model
  • org_slug: The Kodexa organization slug where your model will be hosted
  • taxonomy_ref: The reference to the Kodexa taxonomy that defines the data structure your transformer will use

Project Structure

After running the cookiecutter command, a new directory with your project_slug name will be created with the following structure:

invoice-data-transformer/                # Root directory (project_slug)
├── invoice_data_transformer/            # Python package (pkg_name)
│   ├── __init__.py                      # Package initialization
│   ├── model.py                         # Main model entry point
│   ├── transformer.py                   # Transformer implementation
│   └── data_classes.py                  # Generated data classes (empty initially)
├── .editorconfig                        # Editor configuration
├── .gitignore                           # Git ignore file
├── makefile                             # Makefile with common tasks
├── model.yml                            # Kodexa model deployment configuration
├── pyproject.toml                       # Poetry project configuration
└── README.md                            # Project readme

Key Files

model.py

This is the main entry point for your transformer model. It contains the infer function that:

  • Receives a Kodexa Document as input
  • Initializes the Transformer class
  • Processes the document using the transformer
  • Returns the transformed document

transformer.py

This is where the main transformation logic is implemented. The Transformer class:

  • Processes documents and their extracted data
  • Works with the generated data classes
  • Applies labels to the document based on the transformed data
  • Handles exceptions and error logging

data_classes.py

This file will be generated using the Kodexa CLI based on the provided taxonomy reference. It will contain:

  • Pydantic models representing the data structure
  • Methods for applying labels to documents
  • Utility functions for working with the data

model.yml

This file defines how your model will be deployed to the Kodexa platform, including:

  • Model metadata
  • Runtime configuration
  • Content to include in the deployment package

makefile

The makefile includes several useful commands:

  • make format: Format code using isort and black
  • make lint: Lint code using flake8 and mypy
  • make test: Run formatting, linting, and unit tests
  • make deploy: Deploy the model to Kodexa platform
  • make undeploy: Undeploy the model from Kodexa platform
  • make generate-data-classes: Generate data classes from the taxonomy

Setting Up in Visual Studio Code

To set up your new project in Visual Studio Code:

  1. Open VS Code
  2. Choose “File > Open Folder” and select your newly created project directory
  3. Open a terminal in VS Code (Terminal > New Terminal)
  4. Install dependencies using Poetry:
    poetry install
    
  5. Activate the Poetry virtual environment:
    poetry shell
    
  6. Generate the data classes from your taxonomy:
    make generate-data-classes
    

For the best development experience, install these VS Code extensions:

  1. Python: The official Python extension
  2. Pylance: Enhanced language support for Python
  3. Python Test Explorer: For running tests
  4. YAML: For editing YAML files like model.yml
  5. Docker: For containerization if needed
  6. Markdown All in One: For editing documentation

Understanding Transformer Models

In Kodexa, transformer models are used to process and transform data that has been extracted from documents. The typical workflow is:

  1. Document Processing: The document is processed and data is extracted according to a taxonomy
  2. Data Transformation: The transformer model takes the extracted data and transforms it
  3. Label Application: The transformed data is applied back to the document as labels
  4. Pipeline Integration: The transformed document continues through the pipeline

The Taxonomy and Data Classes

The taxonomy defines the structure of the data that your transformer will work with. When you run make generate-data-classes, the Kodexa CLI will:

  1. Fetch the taxonomy from your Kodexa organization
  2. Generate Pydantic models in data_classes.py
  3. Create all the necessary methods for working with this data

These generated classes allow you to work with strongly-typed data in your transformer.

Implementing Your Transformer

The template creates a basic transformer implementation in transformer.py. The main class is Transformer with a process_document method:

def process_document(self, document: Document, assistant):
    external_data = document.get_external_data()
    try:
        # Clean up the labels/tags
        tagged_nodes = document.select('//word[hasTag()]')
        for node in tagged_nodes:
            [node.remove_tag(tag) for tag in node.get_tags() if tag in node.get_tags()]

        # Import the data_classes module
        from . import data_classes
        from kodexa.model.model import ContentException

        # Go through the external data and dynamically create instances
        data_objects = []

        for class_name, instances in external_data.items():
            if hasattr(data_classes, class_name):
                DataClass = getattr(data_classes, class_name)
                for instance in instances:
                    data_objects.append(DataClass(**instance))
            else:
                logger.warning(f"Class {class_name} not found in data_classes module")

        logger.info(f"Found {len(data_objects)} data objects")

        # Implement the logic to transform the data objects
        transformed_objects = []
        for obj in data_objects:
            transformed_objects.append(obj)

        # Label the document based on the transformed objects
        llm_document_wrapper = KodexaDocumentLLMWrapper(document)
        for obj in transformed_objects:
            obj.apply_labels(llm_document_wrapper, assistant=assistant)

    except Exception as e:
        # Error handling
        document.add_exception(ContentException("Processing Error", str(e)))

    return document

You should modify this method to implement your specific transformation logic. In particular, focus on the section with the comment # Implement the logic to transform the data objects.

Example: Implementing an Invoice Data Transformer

Here’s an example of how you might implement a transformer for invoice data:

1. Generate the data classes

First, make sure you’ve generated the data classes from your taxonomy:

make generate-data-classes

This will create the necessary data classes in data_classes.py. Let’s assume the taxonomy includes classes like Invoice, LineItem, and Vendor.

2. Modify the transformer to process the data

def process_document(self, document: Document, assistant):
    external_data = document.get_external_data()
    try:
        # Clean up existing tags
        tagged_nodes = document.select('//word[hasTag()]')
        for node in tagged_nodes:
            [node.remove_tag(tag) for tag in node.get_tags() if tag in node.get_tags()]

        # Import the data_classes module
        from . import data_classes
        from kodexa.model.model import ContentException

        # Go through the external data and dynamically create instances
        data_objects = []
        for class_name, instances in external_data.items():
            if hasattr(data_classes, class_name):
                DataClass = getattr(data_classes, class_name)
                for instance in instances:
                    data_objects.append(DataClass(**instance))
            else:
                logger.warning(f"Class {class_name} not found in data_classes module")

        logger.info(f"Found {len(data_objects)} data objects")

        # Transform the data objects
        transformed_objects = []
        
        for obj in data_objects:
            # Example transformation: Calculate total for each invoice
            if isinstance(obj, data_classes.Invoice):
                # Find line items associated with this invoice
                line_items = [item for item in data_objects 
                              if isinstance(item, data_classes.line_items) 
                              and item.invoice_id == obj.id]
                # Process line items and check for credit (CR) values
                for item in line_items:
                    amount_value = item.amount.value
                    # Check if the amount has a CR suffix indicating a credit
                    if amount_value and str(amount_value).strip().endswith('CR'):
                        # Extract the numeric part by removing the 'CR' suffix
                        numeric_str = str(amount_value).strip()[:-2].strip()
                        try:
                            # Convert to float and make negative
                            numeric_value = float(numeric_str)
                            if numeric_value > 0:
                                logger.info(f"Converting CR value {amount_value} to negative: {-numeric_value}")
                                amount_value = -numeric_value
                            else:
                                amount_value = numeric_value
                        except ValueError:
                            logger.warning(f"Could not convert {numeric_str} to a number")
                    # If it's already a number, ensure it's properly handled
                    elif isinstance(amount_value, (int, float)) and amount_value > 0:
                        # No conversion needed, use as is
                        pass
                    item.amount.normalized_text = str(amount_value)
                
            # Add the transformed object
            transformed_objects.append(obj)

        # Label the document based on the transformed objects
        llm_document_wrapper = KodexaDocumentLLMWrapper(document)
        for obj in transformed_objects:
            obj.apply_labels(llm_document_wrapper, assistant=assistant)

    except Exception as e:
        error_message = f"An unexpected error occurred: {e}"
        logger.error(error_message)
        document.add_exception(ContentException("Processing Error", error_message))

    return document

This example:

  1. Loads the extracted data using the generated data classes
  2. Performs transformations on invoices:
    • Calculates the total from line items
    • Corrects the invoice total if necessary
    • Adds a status based on the amount
  3. Applies the updated data back to the document as labels

Deploying Your Transformer Model

When your transformer model is ready, you can deploy it to the Kodexa platform:

make deploy

This will use the Kodexa CLI to deploy your model according to the configuration in model.yml.

Troubleshooting

Common Issues

”Missing data classes” errors

If you see errors about missing data classes:

  • Make sure you’ve run make generate-data-classes
  • Check that your taxonomy reference is correct
  • Verify that the taxonomy exists in your Kodexa organization

”No data found” issues

If your transformer doesn’t find any data to transform:

  • Check that the document has been processed by an extractor first
  • Verify that the extractor is using the same taxonomy as your transformer
  • Look for any exceptions in the document

Deployment failures

If your model fails to deploy:

  • Verify that your Kodexa CLI is configured correctly
  • Check if the org_slug in model.yml is correct
  • Look for syntax errors in your Python code