Skip to main content
The LLM module provides a unified interface for calling large language models from your Python modules. All LLM calls are routed through the Kodexa AI Gateway, which handles provider routing, credential management, rate limiting, and cost tracking centrally.

Architecture

Instead of calling provider APIs directly (OpenAI, Bedrock, Gemini, etc.), all LLM requests go through a single gateway:
Your Module Code

ModelManager (discovers available models)

POST /api/ai/chat/completions (X-API-Key auth)

Kodexa AI Gateway (routes to correct provider)

Response (OpenAI-compatible format)
This means your module code never needs provider-specific API keys or SDKs. The platform manages all credentials centrally.

Quick Start

from kodexa_document.llm import ModelManager, ChatMessage

# Get the singleton manager
manager = ModelManager()

# Fetch a model by name
model = manager.get_model("gpt-4o")

# Build messages
messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="What is 2+2?"),
]

# Invoke
response_text, thinking_output, metrics = model.invoke(messages)

print(f"Response: {response_text}")
print(f"Tokens: {metrics.input_tokens} in, {metrics.output_tokens} out")

ModelManager

ModelManager is a singleton that discovers available models from the platform at runtime. On first use, it queries the platform’s cloud-models API and creates a GatewayModelProvider for each model.

Getting Models

from kodexa_document.llm import ModelManager

manager = ModelManager()

# Get a specific model by gateway name
model = manager.get_model("gpt-4o")

# Or by provider model ID (backward compatible)
model = manager.get_model("anthropic.claude-sonnet-4-20250514-v1:0")

# List all available models
for m in manager.get_models():
    print(f"{m.name} ({m.get_model_id()})")

Environment Variables

VariableRequiredDescription
KODEXA_URLYesPlatform base URL (e.g., https://platform.kodexa.ai)
KODEXA_ACCESS_TOKENYesAPI key or access token
KODEXA_LLM_READ_TIMEOUTNoHTTP read timeout in seconds (default: 300)
KODEXA_LLM_CONNECT_TIMEOUTNoHTTP connect timeout in seconds (default: 30)
When running inside a Kodexa module execution, KODEXA_URL and KODEXA_ACCESS_TOKEN are automatically set by the platform. You do not need to configure them manually.

ChatMessage

Represents a message in a conversation with an LLM.
from kodexa_document.llm import ChatMessage

# Text message
msg = ChatMessage(role="user", content="Analyze this document")

# System instruction
system = ChatMessage(role="system", content="You are a document analyst")

# Image message (multimodal)
with open("page.jpg", "rb") as f:
    image_msg = ChatMessage(
        role="user",
        content=f.read(),
        media_type="image/jpeg"
    )

Fields

FieldTypeDescription
rolestrMessage role: "user", "assistant", or "system"
contentstr | bytesText content or image bytes
media_typestrMIME type for image content (e.g., "image/jpeg")
page_keyintOptional page reference for image content
cache_controldictOptional cache control directives

Invoking Models

Basic Invocation

invoke() sends messages and returns the response synchronously.
response_text, thinking_output, metrics = model.invoke(
    messages=[
        ChatMessage(role="user", content="Summarize this text: ...")
    ],
    note="Document summary for invoice processing"
)
Parameters:
ParameterTypeDescription
messagesList[ChatMessage]The conversation messages
notestrOptional label for cost tracking
enable_thinking_modeboolEnable extended thinking (default: False)
Returns: Tuple[str, Optional[str], LLMUsageMetrics]
  • str — The response text
  • Optional[str] — Thinking output (if thinking mode enabled and supported)
  • LLMUsageMetrics — Token usage and timing

Async Invocation

response_text, thinking_output, metrics = await model.ainvoke(
    messages=[ChatMessage(role="user", content="Hello")],
    note="Async greeting"
)
Async invocation requires the httpx package. Install it with: pip install httpx

Streaming

stream_invoke() yields text chunks as they arrive from the model, useful for real-time display.
for chunk in model.stream_invoke(
    messages=[ChatMessage(role="user", content="Write a long analysis...")],
    note="Streaming analysis"
):
    print(chunk, end="", flush=True)

Thinking Mode

Some models (Claude 3.7+, Gemini 2.5+) support extended thinking, where the model shows its reasoning process.
response, thinking, metrics = model.invoke(
    messages=[ChatMessage(role="user", content="Solve this complex problem...")],
    enable_thinking_mode=True
)

if thinking:
    print(f"Reasoning: {thinking}")
print(f"Answer: {response}")

Function Calling / Structured Output

Use invoke_function() to extract structured data using a JSON schema. The model is instructed to call a function with arguments matching your schema.
schema = {
    "type": "object",
    "properties": {
        "vendor_name": {"type": "string", "description": "The vendor's name"},
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "amount": {"type": "number"}
                }
            }
        }
    },
    "required": ["vendor_name", "invoice_number", "total_amount"]
}

result, metrics = model.invoke_function(
    messages=[
        ChatMessage(role="user", content=f"Extract invoice data from:\n{document_text}")
    ],
    schema=schema,
    note="Invoice data extraction"
)

# result is a parsed dict matching your schema
print(f"Vendor: {result['vendor_name']}")
print(f"Total: ${result['total_amount']:.2f}")
An async version is also available:
result, metrics = await model.ainvoke_function(messages, schema)

Multimodal Input

Send images alongside text for visual document analysis:
import base64

# From file
with open("invoice_page1.jpg", "rb") as f:
    image_bytes = f.read()

messages = [
    ChatMessage(role="system", content="You are a document analysis expert."),
    ChatMessage(
        role="user",
        content=image_bytes,
        media_type="image/jpeg"
    ),
    ChatMessage(role="user", content="Extract the total amount from this invoice page."),
]

response, _, metrics = model.invoke(messages, note="Visual invoice extraction")
Images are automatically base64-encoded and sent in the OpenAI multimodal format.

LLMUsageMetrics

Every invocation returns usage metrics for cost tracking and monitoring.
@dataclass
class LLMUsageMetrics:
    model_id: str       # Provider model ID
    input_tokens: int   # Input token count
    output_tokens: int  # Output token count
    duration_ms: int    # Total invocation duration

    def to_dict(self) -> dict:
        """Convert to dictionary for ProcessingStep metadata."""

Cost Tracking

Token usage is automatically recorded in the platform’s model interaction system. Use the note parameter to label interactions for billing visibility:
response, _, metrics = model.invoke(
    messages,
    note=f"Classification for task {task_id}"
)

# Metrics are also recorded by the platform for billing
print(f"Used {metrics.input_tokens + metrics.output_tokens} total tokens")

Capability Checking

Check what a model supports before calling specialized methods:
model = manager.get_model("gpt-4o")

if model.is_function_capable():
    result, metrics = model.invoke_function(messages, schema)

if model.is_thinking_mode_capable():
    response, thinking, metrics = model.invoke(messages, enable_thinking_mode=True)

# Model properties
print(f"Max tokens: {model.get_max_tokens()}")
print(f"Context window: {model.context_window}")
print(f"Model ID: {model.get_model_id()}")

Complete Module Example

Here’s how to use the LLM module in a Kodexa processing module:
import logging
from kodexa_document.llm import ModelManager, ChatMessage

logger = logging.getLogger(__name__)


def infer(document, document_family=None, status_reporter=None):
    """Classify and summarize a document using the AI Gateway."""

    if status_reporter:
        status_reporter.update("Loading AI model", status_type="thinking")

    # Get an LLM model
    manager = ModelManager()
    model = manager.get_model("gpt-4o")

    if not model:
        logger.error("No LLM model available")
        return document

    # Extract document text
    root = document.content_node
    text = root.get_all_content(separator=" ") if root else ""

    if not text:
        logger.warning("Document has no text content")
        return document

    if status_reporter:
        status_reporter.update("Classifying document", status_type="analyzing")

    # Classify the document
    classification_schema = {
        "type": "object",
        "properties": {
            "document_type": {
                "type": "string",
                "enum": ["invoice", "contract", "receipt", "letter", "other"]
            },
            "confidence": {"type": "number"},
            "summary": {"type": "string"}
        },
        "required": ["document_type", "confidence", "summary"]
    }

    result, metrics = model.invoke_function(
        messages=[
            ChatMessage(
                role="system",
                content="Classify this document and provide a brief summary."
            ),
            ChatMessage(
                role="user",
                content=text[:5000]  # First 5000 chars
            ),
        ],
        schema=classification_schema,
        note=f"Document classification: {document_family.name if document_family else 'unknown'}"
    )

    # Store results in document metadata
    document.set_metadata("ai_document_type", result["document_type"])
    document.set_metadata("ai_confidence", result["confidence"])
    document.set_metadata("ai_summary", result["summary"])
    document.add_label(f"type-{result['document_type']}")

    logger.info(
        f"Classified as {result['document_type']} "
        f"(confidence: {result['confidence']:.2f}, "
        f"tokens: {metrics.input_tokens + metrics.output_tokens})"
    )

    return document