Transformer Model
Getting started with Kodexa using the Transformer Model Cookie Cutter
Introduction
The cookie-cutter-kodexa-transformer-model
is a project template that helps you quickly set up a new Kodexa transformer model project with the right structure and dependencies. Transformer models in Kodexa are designed to process documents and transform the extracted data according to your business needs. The template creates a model that can be deployed to a Kodexa platform and integrated into document processing pipelines.
This documentation will guide you through:
- Installing the prerequisites
- Creating a new project from the template
- Understanding the project structure
- Setting up your development environment in VS Code
- Example usage scenarios
Prerequisites
Before using this cookiecutter template, ensure you have the following installed:
- Python 3.11+: The template is designed to work with Python 3.11 or higher
- Cookiecutter: The templating tool that will create your project
- Git: For version control
- Visual Studio Code: For development (recommended)
- Poetry: For dependency management
- Kodexa CLI: For deploying models and generating data classes
Installing Required Tools
You can install the required tools using pip:
Creating a New Project
Once you have the prerequisites installed, you can create a new project from the template by running:
You’ll be prompted to provide several configuration values defined in the cookiecutter.json file:
These values will be used to customize your project. Here’s what each prompt means:
- project_name: The human-readable name of your project
- project_slug: The slug for your model (automatically derived from project_name)
- pkg_name: The Python package name (automatically derived from project_name)
- project_short_description: A short description of what your model does
- full_name: Your name or your organization’s name
- email: Contact email for the project
- github_username: Your GitHub username or organization
- version: The initial version of your model
- org_slug: The Kodexa organization slug where your model will be hosted
- taxonomy_ref: The reference to the Kodexa taxonomy that defines the data structure your transformer will use
Project Structure
After running the cookiecutter command, a new directory with your project_slug name will be created with the following structure:
Key Files
model.py
This is the main entry point for your transformer model. It contains the infer
function that:
- Receives a Kodexa Document as input
- Initializes the Transformer class
- Processes the document using the transformer
- Returns the transformed document
transformer.py
This is where the main transformation logic is implemented. The Transformer
class:
- Processes documents and their extracted data
- Works with the generated data classes
- Applies labels to the document based on the transformed data
- Handles exceptions and error logging
data_classes.py
This file will be generated using the Kodexa CLI based on the provided taxonomy reference. It will contain:
- Pydantic models representing the data structure
- Methods for applying labels to documents
- Utility functions for working with the data
model.yml
This file defines how your model will be deployed to the Kodexa platform, including:
- Model metadata
- Runtime configuration
- Content to include in the deployment package
makefile
The makefile includes several useful commands:
make format
: Format code using isort and blackmake lint
: Lint code using flake8 and mypymake test
: Run formatting, linting, and unit testsmake deploy
: Deploy the model to Kodexa platformmake undeploy
: Undeploy the model from Kodexa platformmake generate-data-classes
: Generate data classes from the taxonomy
Setting Up in Visual Studio Code
To set up your new project in Visual Studio Code:
- Open VS Code
- Choose “File > Open Folder” and select your newly created project directory
- Open a terminal in VS Code (Terminal > New Terminal)
- Install dependencies using Poetry:
- Activate the Poetry virtual environment:
- Generate the data classes from your taxonomy:
Recommended VS Code Extensions
For the best development experience, install these VS Code extensions:
- Python: The official Python extension
- Pylance: Enhanced language support for Python
- Python Test Explorer: For running tests
- YAML: For editing YAML files like model.yml
- Docker: For containerization if needed
- Markdown All in One: For editing documentation
Understanding Transformer Models
In Kodexa, transformer models are used to process and transform data that has been extracted from documents. The typical workflow is:
- Document Processing: The document is processed and data is extracted according to a taxonomy
- Data Transformation: The transformer model takes the extracted data and transforms it
- Label Application: The transformed data is applied back to the document as labels
- Pipeline Integration: The transformed document continues through the pipeline
The Taxonomy and Data Classes
The taxonomy defines the structure of the data that your transformer will work with. When you run make generate-data-classes
, the Kodexa CLI will:
- Fetch the taxonomy from your Kodexa organization
- Generate Pydantic models in
data_classes.py
- Create all the necessary methods for working with this data
These generated classes allow you to work with strongly-typed data in your transformer.
Implementing Your Transformer
The template creates a basic transformer implementation in transformer.py
. The main class is Transformer
with a process_document
method:
You should modify this method to implement your specific transformation logic. In particular, focus on the section with the comment # Implement the logic to transform the data objects
.
Example: Implementing an Invoice Data Transformer
Here’s an example of how you might implement a transformer for invoice data:
1. Generate the data classes
First, make sure you’ve generated the data classes from your taxonomy:
This will create the necessary data classes in data_classes.py
. Let’s assume the taxonomy includes classes like Invoice
, LineItem
, and Vendor
.
2. Modify the transformer to process the data
This example:
- Loads the extracted data using the generated data classes
- Performs transformations on invoices:
- Calculates the total from line items
- Corrects the invoice total if necessary
- Adds a status based on the amount
- Applies the updated data back to the document as labels
Deploying Your Transformer Model
When your transformer model is ready, you can deploy it to the Kodexa platform:
This will use the Kodexa CLI to deploy your model according to the configuration in model.yml.
Troubleshooting
Common Issues
”Missing data classes” errors
If you see errors about missing data classes:
- Make sure you’ve run
make generate-data-classes
- Check that your taxonomy reference is correct
- Verify that the taxonomy exists in your Kodexa organization
”No data found” issues
If your transformer doesn’t find any data to transform:
- Check that the document has been processed by an extractor first
- Verify that the extractor is using the same taxonomy as your transformer
- Look for any exceptions in the document
Deployment failures
If your model fails to deploy:
- Verify that your Kodexa CLI is configured correctly
- Check if the org_slug in model.yml is correct
- Look for syntax errors in your Python code