The LLM module provides a unified interface for calling large language models from your Python modules. All LLM calls are routed through the Kodexa AI Gateway, which handles provider routing, credential management, rate limiting, and cost tracking centrally.
Architecture
Instead of calling provider APIs directly (OpenAI, Bedrock, Gemini, etc.), all LLM requests go through a single gateway:
Your Module Code
↓
ModelManager (discovers available models)
↓
POST /api/ai/chat/completions (X-API-Key auth)
↓
Kodexa AI Gateway (routes to correct provider)
↓
Response (OpenAI-compatible format)
This means your module code never needs provider-specific API keys or SDKs. The platform manages all credentials centrally.
Quick Start
from kodexa_document.llm import ModelManager, ChatMessage
# Get the singleton manager
manager = ModelManager()
# Fetch a model by name
model = manager.get_model("gpt-4o")
# Build messages
messages = [
ChatMessage(role="system", content="You are a helpful assistant."),
ChatMessage(role="user", content="What is 2+2?"),
]
# Invoke
response_text, thinking_output, metrics = model.invoke(messages)
print(f"Response: {response_text}")
print(f"Tokens: {metrics.input_tokens} in, {metrics.output_tokens} out")
ModelManager
ModelManager is a singleton that discovers available models from the platform at runtime. On first use, it queries the platform’s cloud-models API and creates a GatewayModelProvider for each model.
Getting Models
from kodexa_document.llm import ModelManager
manager = ModelManager()
# Get a specific model by gateway name
model = manager.get_model("gpt-4o")
# Or by provider model ID (backward compatible)
model = manager.get_model("anthropic.claude-sonnet-4-20250514-v1:0")
# List all available models
for m in manager.get_models():
print(f"{m.name} ({m.get_model_id()})")
Environment Variables
| Variable | Required | Description |
|---|
KODEXA_URL | Yes | Platform base URL (e.g., https://platform.kodexa.ai) |
KODEXA_ACCESS_TOKEN | Yes | API key or access token |
KODEXA_LLM_READ_TIMEOUT | No | HTTP read timeout in seconds (default: 300) |
KODEXA_LLM_CONNECT_TIMEOUT | No | HTTP connect timeout in seconds (default: 30) |
When running inside a Kodexa module execution, KODEXA_URL and KODEXA_ACCESS_TOKEN are automatically set by the platform. You do not need to configure them manually.
ChatMessage
Represents a message in a conversation with an LLM.
from kodexa_document.llm import ChatMessage
# Text message
msg = ChatMessage(role="user", content="Analyze this document")
# System instruction
system = ChatMessage(role="system", content="You are a document analyst")
# Image message (multimodal)
with open("page.jpg", "rb") as f:
image_msg = ChatMessage(
role="user",
content=f.read(),
media_type="image/jpeg"
)
Fields
| Field | Type | Description |
|---|
role | str | Message role: "user", "assistant", or "system" |
content | str | bytes | Text content or image bytes |
media_type | str | MIME type for image content (e.g., "image/jpeg") |
page_key | int | Optional page reference for image content |
cache_control | dict | Optional cache control directives |
Invoking Models
Basic Invocation
invoke() sends messages and returns the response synchronously.
response_text, thinking_output, metrics = model.invoke(
messages=[
ChatMessage(role="user", content="Summarize this text: ...")
],
note="Document summary for invoice processing"
)
Parameters:
| Parameter | Type | Description |
|---|
messages | List[ChatMessage] | The conversation messages |
note | str | Optional label for cost tracking |
enable_thinking_mode | bool | Enable extended thinking (default: False) |
Returns: Tuple[str, Optional[str], LLMUsageMetrics]
str — The response text
Optional[str] — Thinking output (if thinking mode enabled and supported)
LLMUsageMetrics — Token usage and timing
Async Invocation
response_text, thinking_output, metrics = await model.ainvoke(
messages=[ChatMessage(role="user", content="Hello")],
note="Async greeting"
)
Async invocation requires the httpx package. Install it with: pip install httpx
Streaming
stream_invoke() yields text chunks as they arrive from the model, useful for real-time display.
for chunk in model.stream_invoke(
messages=[ChatMessage(role="user", content="Write a long analysis...")],
note="Streaming analysis"
):
print(chunk, end="", flush=True)
Thinking Mode
Some models (Claude 3.7+, Gemini 2.5+) support extended thinking, where the model shows its reasoning process.
response, thinking, metrics = model.invoke(
messages=[ChatMessage(role="user", content="Solve this complex problem...")],
enable_thinking_mode=True
)
if thinking:
print(f"Reasoning: {thinking}")
print(f"Answer: {response}")
Function Calling / Structured Output
Use invoke_function() to extract structured data using a JSON schema. The model is instructed to call a function with arguments matching your schema.
schema = {
"type": "object",
"properties": {
"vendor_name": {"type": "string", "description": "The vendor's name"},
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
}
}
}
},
"required": ["vendor_name", "invoice_number", "total_amount"]
}
result, metrics = model.invoke_function(
messages=[
ChatMessage(role="user", content=f"Extract invoice data from:\n{document_text}")
],
schema=schema,
note="Invoice data extraction"
)
# result is a parsed dict matching your schema
print(f"Vendor: {result['vendor_name']}")
print(f"Total: ${result['total_amount']:.2f}")
An async version is also available:
result, metrics = await model.ainvoke_function(messages, schema)
Send images alongside text for visual document analysis:
import base64
# From file
with open("invoice_page1.jpg", "rb") as f:
image_bytes = f.read()
messages = [
ChatMessage(role="system", content="You are a document analysis expert."),
ChatMessage(
role="user",
content=image_bytes,
media_type="image/jpeg"
),
ChatMessage(role="user", content="Extract the total amount from this invoice page."),
]
response, _, metrics = model.invoke(messages, note="Visual invoice extraction")
Images are automatically base64-encoded and sent in the OpenAI multimodal format.
LLMUsageMetrics
Every invocation returns usage metrics for cost tracking and monitoring.
@dataclass
class LLMUsageMetrics:
model_id: str # Provider model ID
input_tokens: int # Input token count
output_tokens: int # Output token count
duration_ms: int # Total invocation duration
def to_dict(self) -> dict:
"""Convert to dictionary for ProcessingStep metadata."""
Cost Tracking
Token usage is automatically recorded in the platform’s model interaction system. Use the note parameter to label interactions for billing visibility:
response, _, metrics = model.invoke(
messages,
note=f"Classification for task {task_id}"
)
# Metrics are also recorded by the platform for billing
print(f"Used {metrics.input_tokens + metrics.output_tokens} total tokens")
Capability Checking
Check what a model supports before calling specialized methods:
model = manager.get_model("gpt-4o")
if model.is_function_capable():
result, metrics = model.invoke_function(messages, schema)
if model.is_thinking_mode_capable():
response, thinking, metrics = model.invoke(messages, enable_thinking_mode=True)
# Model properties
print(f"Max tokens: {model.get_max_tokens()}")
print(f"Context window: {model.context_window}")
print(f"Model ID: {model.get_model_id()}")
Complete Module Example
Here’s how to use the LLM module in a Kodexa processing module:
import logging
from kodexa_document.llm import ModelManager, ChatMessage
logger = logging.getLogger(__name__)
def infer(document, document_family=None, status_reporter=None):
"""Classify and summarize a document using the AI Gateway."""
if status_reporter:
status_reporter.update("Loading AI model", status_type="thinking")
# Get an LLM model
manager = ModelManager()
model = manager.get_model("gpt-4o")
if not model:
logger.error("No LLM model available")
return document
# Extract document text
root = document.content_node
text = root.get_all_content(separator=" ") if root else ""
if not text:
logger.warning("Document has no text content")
return document
if status_reporter:
status_reporter.update("Classifying document", status_type="analyzing")
# Classify the document
classification_schema = {
"type": "object",
"properties": {
"document_type": {
"type": "string",
"enum": ["invoice", "contract", "receipt", "letter", "other"]
},
"confidence": {"type": "number"},
"summary": {"type": "string"}
},
"required": ["document_type", "confidence", "summary"]
}
result, metrics = model.invoke_function(
messages=[
ChatMessage(
role="system",
content="Classify this document and provide a brief summary."
),
ChatMessage(
role="user",
content=text[:5000] # First 5000 chars
),
],
schema=classification_schema,
note=f"Document classification: {document_family.name if document_family else 'unknown'}"
)
# Store results in document metadata
document.set_metadata("ai_document_type", result["document_type"])
document.set_metadata("ai_confidence", result["confidence"])
document.set_metadata("ai_summary", result["summary"])
document.add_label(f"type-{result['document_type']}")
logger.info(
f"Classified as {result['document_type']} "
f"(confidence: {result['confidence']:.2f}, "
f"tokens: {metrics.input_tokens + metrics.output_tokens})"
)
return document