Introduction

The cookie-cutter-kodexa-infer-model is a project template that helps you quickly set up a new Kodexa inference model project with the right structure and dependencies. This template creates a model that can be deployed to a Kodexa platform for document processing and data extraction.

This documentation will guide you through:

  • Installing the prerequisites
  • Creating a new project from the template
  • Understanding the project structure
  • Setting up your development environment in VS Code
  • Example usage scenarios

Prerequisites

Before using this cookiecutter template, ensure you have the following installed:

  1. Python 3.11+: The template is designed to work with Python 3.11 or higher
  2. Cookiecutter: The templating tool that will create your project
  3. Git: For version control
  4. Visual Studio Code: For development (recommended)
  5. Poetry: For dependency management (recommended)
  6. Kodexa CLI: For deploying models to Kodexa platform

Installing Required Tools

You can install the required tools using pip:

# Install cookiecutter
pip install cookiecutter

# Install poetry
pip install poetry

# Install Kodexa CLI
pip install kodexa-cli

Creating a New Project

Once you have the prerequisites installed, you can create a new project from the template by running:

cookiecutter https://github.com/kodexa-labs/cookie-cutter-kodexa-infer-model

You’ll be prompted to provide several configuration values defined in the cookiecutter.json file:

project_name [My Kodexa Infer Model]: Document Classifier
project_slug [document-classifier]: 
pkg_name [document_classifier]: 
project_short_description [Skeleton project created by Cookiecutter Kodexa Event Model]: A model that classifies documents by type
full_name [Kodexa Support]: Jane Smith
email [support@kodexa.com]: jane.smith@example.com
github_username [kodexa-ai]: janesmith
version [0.1.0]: 
org_slug [my-org]: janes-org

These values will be used to customize your project. Here’s what each prompt means:

  • project_name: The human-readable name of your project
  • project_slug: The slug for your model (automatically derived from project_name)
  • pkg_name: The Python package name (automatically derived from project_name)
  • project_short_description: A short description of what your model does
  • full_name: Your name or your organization’s name
  • email: Contact email for the project
  • github_username: Your GitHub username or organization
  • version: The initial version of your model
  • org_slug: The Kodexa organization slug where your model will be hosted

Project Structure

After running the cookiecutter command, a new directory with your project_slug name will be created with the following structure:

document-classifier/                  # Root directory (project_slug)
├── document_classifier/              # Python package (pkg_name)
│   ├── __init__.py                   # Package initialization
│   └── model.py                      # Main model implementation
├── .editorconfig                     # Editor configuration
├── .gitignore                        # Git ignore file
├── makefile                          # Makefile with common tasks
├── model.yml                         # Kodexa model deployment configuration
├── pyproject.toml                    # Poetry project configuration
└── README.md                         # Project readme

Key Files

model.py

This is the main file where you’ll implement your inference model. It comes with a sample implementation that:

  • Receives a Kodexa Document as input
  • Has access to the project, pipeline context, and assistant
  • Can add labels to the document
  • Can access the document’s source bytes
  • Returns the processed document

model.yml

This file defines how your model will be deployed to the Kodexa platform, including:

  • Model metadata
  • Runtime configuration
  • Access settings
  • Content to include in the deployment package

pyproject.toml

This file contains your project’s metadata and dependencies managed by Poetry, including:

  • Project information
  • Python version requirements
  • Dependencies (including Kodexa)
  • Development tools configuration (black, isort, flake8, mypy)

makefile

The makefile includes several useful commands:

  • make format: Format code using isort and black
  • make lint: Lint code using flake8 and mypy
  • make test: Run formatting, linting, and unit tests
  • make deploy: Deploy the model to Kodexa platform
  • make undeploy: Undeploy the model from Kodexa platform

Setting Up in Visual Studio Code

To set up your new project in Visual Studio Code:

  1. Open VS Code
  2. Choose “File > Open Folder” and select your newly created project directory
  3. Open a terminal in VS Code (Terminal > New Terminal)
  4. Install dependencies using Poetry:
    poetry install
    
  5. Activate the Poetry virtual environment:
    poetry shell
    

For the best development experience, install these VS Code extensions:

  1. Python: The official Python extension
  2. Pylance: Enhanced language support for Python
  3. Python Test Explorer: For running tests
  4. YAML: For editing YAML files like model.yml
  5. Docker: For containerization if needed
  6. Markdown All in One: For editing documentation

Implementing Your Model

The template creates a basic model implementation in pkg_name/model.py. The main entry point is the infer function:

def infer(document: Document, project: ProjectEndpoint, pipeline_context: PipelineContext, assistant: Assistant):
    # Your model implementation here
    document.add_label("my_first_model")
    return document

You should modify this function to implement your specific document processing logic. The function receives:

  • document: The Kodexa Document to process
  • project: The Kodexa project endpoint
  • pipeline_context: Context information about the current pipeline
  • assistant: The Kodexa assistant for interaction with large language models

Example: Implementing a Document Classifier

Here’s an example of how you might implement a simple document classifier:

1. Modify the model.py file

def infer(document: Document, project: ProjectEndpoint, pipeline_context: PipelineContext, assistant: Assistant):
    """
    Classify a document based on its content
    """
    logger.info(f"Processing document: {document.uuid}")
    
    # Get document text
    all_text = document.content_node.get_all_content()
    
    # Simple classification rules
    document_type = "unknown"
    
    if "invoice" in all_text.lower() and ("total" in all_text.lower() or "amount due" in all_text.lower()):
        document_type = "invoice"
    elif "agreement" in all_text.lower() and "parties" in all_text.lower():
        document_type = "contract"
    elif "resume" in all_text.lower() or "curriculum vitae" in all_text.lower():
        document_type = "resume"
    
    # Add classification as a label
    document.add_label("document_type", document_type)
    logger.info(f"Classified document as: {document_type}")
    
    return document

2. Deploy your model

Once you’re satisfied with your model, you can deploy it to the Kodexa platform:

make deploy

This will use the Kodexa CLI to deploy your model according to the configuration in model.yml.

Working with the Kodexa Platform

Deploying Your Model

The template includes commands to deploy and undeploy your model:

# Deploy the model
make deploy

# Undeploy the model
make undeploy

These commands use the Kodexa CLI and the configuration in model.yml to manage your model on the Kodexa platform.

Using Your Model in Data Flow

You can now go to Studio and add your model to an Assistant in your project.

Troubleshooting

Common Issues

”Cannot find module” errors

If you encounter module import errors, make sure:

  • Your Poetry environment is activated (poetry shell)
  • The package is installed in development mode (poetry install)
  • Your import statements use the correct package name

Deployment failures

If your model fails to deploy:

  • Check if your Kodexa CLI is configured correctly
  • Verify that the org_slug in model.yml is correct
  • Look for syntax errors in your Python code
  • Check if your model.yml is properly formatted

Model not working as expected

If your deployed model doesn’t work as expected:

  • Add more logging in your infer function to understand what’s happening
  • Check if your model is receiving the correct document format
  • Verify that you’re returning the document object from your infer function