Transformer Model

Introduction

The cookie-cutter-kodexa-transformer-model is a project template that helps you quickly set up a new Kodexa transformer model project with the right structure and dependencies. Transformer models in Kodexa are designed to process documents and transform the extracted data according to your business needs. The template creates a model that can be deployed to a Kodexa platform and integrated into document processing pipelines.

This documentation will guide you through:

Installing the prerequisites
Creating a new project from the template
Understanding the project structure
Setting up your development environment in VS Code
Example usage scenarios

Prerequisites

Before using this cookiecutter template, ensure you have the following installed:

Python 3.11+: The template is designed to work with Python 3.11 or higher
Cookiecutter: The templating tool that will create your project
Git: For version control
Visual Studio Code: For development (recommended)
Poetry: For dependency management
Kodexa CLI: For deploying models and generating data classes

Installing Required Tools

You can install the required tools using pip:

# Install cookiecutter
pip install cookiecutter

# Install poetry
pip install poetry

# Install Kodexa CLI Stable Version
pip install "kodexa-cli>=7.4,<7.5" --upgrade

Creating a New Project

Once you have the prerequisites installed, you can create a new project from the template by running:

cookiecutter https://github.com/kodexa-labs/cookie-cutter-kodexa-transformer-model

You’ll be prompted to provide several configuration values defined in the cookiecutter.json file:

project_name [My Kodexa Transformer Model]: Invoice Data Transformer
project_slug [invoice-data-transformer]: 
pkg_name [invoice_data_transformer]: 
project_short_description [Skeleton project created by Cookiecutter Kodexa Transformer Model]: A model that transforms extracted invoice data
full_name [Kodexa Support]: Jane Smith
email [support@kodexa.com]: jane.smith@example.com
github_username [kodexa-ai]: janesmith
version [0.1.0]: 
org_slug [my-org]: janes-org
taxonomy_ref []: invoice-taxonomy/v1

These values will be used to customize your project. Here’s what each prompt means:

project_name: The human-readable name of your project
project_slug: The slug for your model (automatically derived from project_name)
pkg_name: The Python package name (automatically derived from project_name)
project_short_description: A short description of what your model does
full_name: Your name or your organization’s name
email: Contact email for the project
github_username: Your GitHub username or organization
version: The initial version of your model
org_slug: The Kodexa organization slug where your model will be hosted
taxonomy_ref: The reference to the Kodexa taxonomy that defines the data structure your transformer will use

Project Structure

After running the cookiecutter command, a new directory with your project_slug name will be created with the following structure:

invoice-data-transformer/                # Root directory (project_slug)
├── invoice_data_transformer/            # Python package (pkg_name)
│   ├── __init__.py                      # Package initialization
│   ├── model.py                         # Main model entry point
│   ├── transformer.py                   # Transformer implementation
│   └── data_classes.py                  # Generated data classes (empty initially)
├── .editorconfig                        # Editor configuration
├── .gitignore                           # Git ignore file
├── makefile                             # Makefile with common tasks
├── model.yml                            # Kodexa model deployment configuration
├── pyproject.toml                       # Poetry project configuration
└── README.md                            # Project readme

Key Files

`model.py`

This is the main entry point for your transformer model. It contains the infer function that:

Receives a Kodexa Document as input
Initializes the Transformer class
Processes the document using the transformer
Returns the transformed document

`transformer.py`

This is where the main transformation logic is implemented. The Transformer class:

Processes documents and their extracted data
Works with the generated data classes
Applies labels to the document based on the transformed data
Handles exceptions and error logging

`data_classes.py`

This file will be generated using the Kodexa CLI based on the provided taxonomy reference. It will contain:

Pydantic models representing the data structure
Methods for applying labels to documents
Utility functions for working with the data

`model.yml`

This file defines how your model will be deployed to the Kodexa platform, including:

Model metadata
Runtime configuration
Content to include in the deployment package

`makefile`

The makefile includes several useful commands:

make format: Format code using isort and black
make lint: Lint code using flake8 and mypy
make test: Run formatting, linting, and unit tests
make deploy: Deploy the model to Kodexa platform
make undeploy: Undeploy the model from Kodexa platform
make generate-data-classes: Generate data classes from the taxonomy

Setting Up in Visual Studio Code

To set up your new project in Visual Studio Code:

Open VS Code
Choose “File > Open Folder” and select your newly created project directory
Open a terminal in VS Code (Terminal > New Terminal)
Install dependencies using Poetry:
```
poetry install
```
Activate the Poetry virtual environment:
```
poetry shell
```
Generate the data classes from your taxonomy:
```
make generate-data-classes
```

Recommended VS Code Extensions

For the best development experience, install these VS Code extensions:

Python: The official Python extension
Pylance: Enhanced language support for Python
Python Test Explorer: For running tests
YAML: For editing YAML files like model.yml
Docker: For containerization if needed
Markdown All in One: For editing documentation

Understanding Transformer Models

In Kodexa, transformer models are used to process and transform data that has been extracted from documents. The typical workflow is:

Document Processing: The document is processed and data is extracted according to a taxonomy
Data Transformation: The transformer model takes the extracted data and transforms it
Label Application: The transformed data is applied back to the document as labels
Pipeline Integration: The transformed document continues through the pipeline

The Taxonomy and Data Classes

The taxonomy defines the structure of the data that your transformer will work with. When you run make generate-data-classes, the Kodexa CLI will:

Fetch the taxonomy from your Kodexa organization
Generate Pydantic models in data_classes.py
Create all the necessary methods for working with this data

These generated classes allow you to work with strongly-typed data in your transformer.

Implementing Your Transformer

The template creates a basic transformer implementation in transformer.py. The main class is Transformer with a process_document method:

def process_document(self, document: Document, assistant):
    external_data = document.get_external_data()
    try:
        # Clean up the labels/tags
        tagged_nodes = document.select('//word[hasTag()]')
        for node in tagged_nodes:
            [node.remove_tag(tag) for tag in node.get_tags() if tag in node.get_tags()]

        # Import the data_classes module
        from . import data_classes
        from kodexa.model.model import ContentException

        # Go through the external data and dynamically create instances
        data_objects = []

        for class_name, instances in external_data.items():
            if hasattr(data_classes, class_name):
                DataClass = getattr(data_classes, class_name)
                for instance in instances:
                    data_objects.append(DataClass(**instance))
            else:
                logger.warning(f"Class {class_name} not found in data_classes module")

        logger.info(f"Found {len(data_objects)} data objects")

        # Implement the logic to transform the data objects
        transformed_objects = []
        for obj in data_objects:
            transformed_objects.append(obj)

        # Label the document based on the transformed objects
        llm_document_wrapper = KodexaDocumentLLMWrapper(document)
        for obj in transformed_objects:
            obj.apply_labels(llm_document_wrapper, assistant=assistant)

    except Exception as e:
        # Error handling
        document.add_exception(ContentException("Processing Error", str(e)))

    return document

You should modify this method to implement your specific transformation logic. In particular, focus on the section with the comment # Implement the logic to transform the data objects.

Example: Implementing an Invoice Data Transformer

Here’s an example of how you might implement a transformer for invoice data:

1. Generate the data classes

First, make sure you’ve generated the data classes from your taxonomy:

make generate-data-classes

This will create the necessary data classes in data_classes.py. Let’s assume the taxonomy includes classes like Invoice, LineItem, and Vendor.

2. Modify the transformer to process the data

def process_document(self, document: Document, assistant):
    external_data = document.get_external_data()
    try:
        # Clean up existing tags
        tagged_nodes = document.select('//word[hasTag()]')
        for node in tagged_nodes:
            [node.remove_tag(tag) for tag in node.get_tags() if tag in node.get_tags()]

        # Import the data_classes module
        from . import data_classes
        from kodexa.model.model import ContentException

        # Go through the external data and dynamically create instances
        data_objects = []
        for class_name, instances in external_data.items():
            if hasattr(data_classes, class_name):
                DataClass = getattr(data_classes, class_name)
                for instance in instances:
                    data_objects.append(DataClass(**instance))
            else:
                logger.warning(f"Class {class_name} not found in data_classes module")

        logger.info(f"Found {len(data_objects)} data objects")

        # Transform the data objects
        transformed_objects = []
        
        for obj in data_objects:
            # Example transformation: Calculate total for each invoice
            if isinstance(obj, data_classes.Invoice):
                # Find line items associated with this invoice
                line_items = [item for item in data_objects 
                              if isinstance(item, data_classes.line_items) 
                              and item.invoice_id == obj.id]
                # Process line items and check for credit (CR) values
                for item in line_items:
                    amount_value = item.amount.value
                    # Check if the amount has a CR suffix indicating a credit
                    if amount_value and str(amount_value).strip().endswith('CR'):
                        # Extract the numeric part by removing the 'CR' suffix
                        numeric_str = str(amount_value).strip()[:-2].strip()
                        try:
                            # Convert to float and make negative
                            numeric_value = float(numeric_str)
                            if numeric_value > 0:
                                logger.info(f"Converting CR value {amount_value} to negative: {-numeric_value}")
                                amount_value = -numeric_value
                            else:
                                amount_value = numeric_value
                        except ValueError:
                            logger.warning(f"Could not convert {numeric_str} to a number")
                    # If it's already a number, ensure it's properly handled
                    elif isinstance(amount_value, (int, float)) and amount_value > 0:
                        # No conversion needed, use as is
                        pass
                    item.amount.normalized_text = str(amount_value)
                
            # Add the transformed object
            transformed_objects.append(obj)

        # Label the document based on the transformed objects
        llm_document_wrapper = KodexaDocumentLLMWrapper(document)
        for obj in transformed_objects:
            obj.apply_labels(llm_document_wrapper, assistant=assistant)

    except Exception as e:
        error_message = f"An unexpected error occurred: {e}"
        logger.error(error_message)
        document.add_exception(ContentException("Processing Error", error_message))

    return document

This example:

Loads the extracted data using the generated data classes
Performs transformations on invoices:
- Calculates the total from line items
- Corrects the invoice total if necessary
- Adds a status based on the amount
Applies the updated data back to the document as labels

Deploying Your Transformer Model

When your transformer model is ready, you can deploy it to the Kodexa platform:

make deploy

This will use the Kodexa CLI to deploy your model according to the configuration in model.yml.

Troubleshooting

Common Issues

”Missing data classes” errors

If you see errors about missing data classes:

Make sure you’ve run make generate-data-classes
Check that your taxonomy reference is correct
Verify that the taxonomy exists in your Kodexa organization

”No data found” issues

If your transformer doesn’t find any data to transform:

Check that the document has been processed by an extractor first
Verify that the extractor is using the same taxonomy as your transformer
Look for any exceptions in the document

Deployment failures

If your model fails to deploy:

Verify that your Kodexa CLI is configured correctly
Check if the org_slug in model.yml is correct
Look for syntax errors in your Python code

Introduction

Kodexa CLI

Deployment

Notebooks

Cookie Cutters

Introduction

Prerequisites

Installing Required Tools

Creating a New Project

Project Structure

Key Files

`model.py`

`transformer.py`

`data_classes.py`

`model.yml`

`makefile`

Setting Up in Visual Studio Code

Recommended VS Code Extensions

Understanding Transformer Models

The Taxonomy and Data Classes

Implementing Your Transformer

Example: Implementing an Invoice Data Transformer

1. Generate the data classes

2. Modify the transformer to process the data

Deploying Your Transformer Model

Troubleshooting

Common Issues

”Missing data classes” errors

”No data found” issues

Deployment failures

Introduction

Kodexa CLI

Deployment

Notebooks

Cookie Cutters

​Introduction

​Prerequisites

​Installing Required Tools

​Creating a New Project

​Project Structure

​Key Files

​model.py

​transformer.py

​data_classes.py

​model.yml

​makefile

​Setting Up in Visual Studio Code

​Recommended VS Code Extensions

​Understanding Transformer Models

​The Taxonomy and Data Classes

​Implementing Your Transformer

​Example: Implementing an Invoice Data Transformer

​1. Generate the data classes

​2. Modify the transformer to process the data

​Deploying Your Transformer Model

​Troubleshooting

​Common Issues

​”Missing data classes” errors

​”No data found” issues

​Deployment failures

Introduction

Prerequisites

Installing Required Tools

Creating a New Project

Project Structure

Key Files

`model.py`

`transformer.py`

`data_classes.py`

`model.yml`

`makefile`

Setting Up in Visual Studio Code

Recommended VS Code Extensions

Understanding Transformer Models

The Taxonomy and Data Classes

Implementing Your Transformer

Example: Implementing an Invoice Data Transformer

1. Generate the data classes

2. Modify the transformer to process the data

Deploying Your Transformer Model

Troubleshooting

Common Issues

”Missing data classes” errors

”No data found” issues

Deployment failures