Documentation Index
Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
Use this file to discover all available pages before exploring further.
The extraction subsystem converts tagged document content into structured data objects and attributes using taxonomy definitions. It runs in the Go core via CFFI for performance.
Core Concepts
- Taxonomy — Defines the structure of data to extract (groups, fields, types)
- ExtractionEngine — Processes a document against taxonomies to produce data objects
- DataObject — A structured record extracted from the document (e.g., an invoice header)
- DataAttribute — A field within a data object (e.g., invoice number, date, total)
- DataException — A validation error on a data object or attribute
Taxonomy
A Taxonomy wraps a Go-side taxonomy handle for use with the extraction engine. Create one from a dict or a JSON file path:
from kodexa_document import Taxonomy
# From a dictionary
taxonomy = Taxonomy(taxonomy_data={
"name": "Invoice",
"slug": "invoice",
"taxons": [
{
"name": "Header",
"taxonType": "GROUP",
"children": [
{"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
{"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
{"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"}
]
}
]
})
# From a JSON file
taxonomy = Taxonomy(taxonomy_path="/path/to/taxonomy.json")
Taxonomies implement the context manager protocol and should be closed when no longer needed:
with Taxonomy(taxonomy_data=data) as taxonomy:
# Use taxonomy
is_valid = taxonomy.validate()
json_str = taxonomy.to_json()
# Automatically freed
If you don’t use a context manager, the Go handle is still cleaned up automatically via weakref.finalize, but using with is recommended for deterministic cleanup.
The ExtractionEngine processes a document against one or more taxonomies to extract structured data.
Constructor
from kodexa_document import Document, ExtractionEngine, Taxonomy
with Document.from_kddb("tagged_invoice.kddb") as doc:
taxonomy = Taxonomy(taxonomy_data=taxonomy_dict)
engine = ExtractionEngine(
document=doc,
taxonomies=[taxonomy],
owner_uri="model://my-org/invoice-model" # optional
)
| Parameter | Type | Description |
|---|
document | Document | The document to extract from |
taxonomies | List[Taxonomy | dict] | One or more taxonomies (dicts are auto-wrapped) |
owner_uri | str | None | URI identifying the extraction owner (defaults to "model://default") |
process_and_save
Runs extraction and persists the results (data objects, attributes, exceptions) into the document’s SQLite store:
with ExtractionEngine(doc, [taxonomy]) as engine:
count = engine.process_and_save()
print(f"Extracted {count} data objects")
Content Exceptions
After extraction, retrieve any content-level exceptions:
with ExtractionEngine(doc, [taxonomy]) as engine:
engine.process_and_save()
exceptions = engine.get_content_exceptions()
for exc in exceptions:
print(f"Exception: {exc.message} at {exc.tag}")
Document Taxon Validations
Check which taxons were found or missing in the document:
with ExtractionEngine(doc, [taxonomy]) as engine:
engine.process_and_save()
validations = engine.get_document_taxon_validations()
for v in validations:
print(f"{v.taxon_path}: {v.validation}")
DataObject
Represents a structured record extracted from the document. Created by the extraction engine, accessed via DataObjectAccessor.
| Field | Type | Description |
|---|
id | int | Database ID |
uuid | str | Unique identifier |
name | str | Display name |
type | str | Object type |
taxonomy_ref | str | Reference to the source taxonomy |
group_path | str | Path in the taxonomy group hierarchy |
group_uuid | str | UUID of the group this belongs to |
parent_group_uuid | str | UUID of the parent group |
parent_id | int | Parent data object ID |
attributes | List[DataAttribute] | Child attributes |
data_exceptions | List[DataException] | Validation errors |
children | List[DataObject] | Child data objects |
DataAttribute
A field value within a data object.
| Field | Type | Description |
|---|
uuid | str | Unique identifier |
data_object_id | int | Parent data object ID |
path | str | Taxonomy path of this attribute |
tag | str | Tag name used for extraction |
tag_uuid | str | UUID of the source tag |
value | Any | The extracted value |
confidence | float | Extraction confidence (0.0 - 1.0) |
taxonomy_ref | str | Source taxonomy reference |
taxon_type | str | Data type (STRING, DECIMAL, DATE, etc.) |
node_uuid | str | UUID of the content node this was extracted from |
source | str | Source identifier |
manual | bool | Whether this was manually entered |
data_exceptions | List[DataException] | Validation errors |
DataException
A validation error attached to a data object or attribute.
| Field | Type | Description |
|---|
id | int | Database ID |
uuid | str | Unique identifier |
message | str | Human-readable error message |
exception_type | str | Type classification |
severity | str | Severity level |
path | str | Taxonomy path where the error occurred |
open | bool | Whether this exception is still open |
data_object_id | int | Related data object |
data_attribute_id | int | Related data attribute |
DocumentTaxonValidation
Reports whether a specific taxon was found during extraction.
| Field | Type | Description |
|---|
taxonomy_ref | str | The taxonomy reference |
taxon_path | str | Path of the taxon |
validation | str | Validation result |
Complete Example
from kodexa_document import (
Document, ExtractionEngine, Taxonomy,
DataObjectAccessor, DataAttributeAccessor
)
# Define taxonomy
taxonomy_data = {
"name": "Invoice",
"slug": "invoice",
"taxons": [
{
"name": "Invoice",
"taxonType": "GROUP",
"children": [
{"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
{"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
{"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"},
]
}
]
}
with Document.from_kddb("tagged_invoice.kddb") as doc:
# Create taxonomy and run extraction
with Taxonomy(taxonomy_data=taxonomy_data) as taxonomy:
with ExtractionEngine(doc, [taxonomy], owner_uri="model://my-org/invoice-model") as engine:
count = engine.process_and_save()
print(f"Extracted {count} data objects")
# Check for exceptions
exceptions = engine.get_content_exceptions()
for exc in exceptions:
print(f"Warning: {exc.message}")
# Check validations
validations = engine.get_document_taxon_validations()
for v in validations:
print(f" {v.taxon_path}: {v.validation}")
# Access the extracted data
obj_accessor = DataObjectAccessor(doc)
attr_accessor = DataAttributeAccessor(doc)
for obj in obj_accessor.get_all():
print(f"\nData Object: {obj.get('name')} ({obj.get('uuid')})")
attrs = attr_accessor.get_for_data_object(obj["id"])
for attr in attrs:
print(f" {attr.get('path')}: {attr.get('value')} "
f"(confidence: {attr.get('confidence', 'N/A')})")
# Save the document with extracted data
doc.save("extracted_invoice.kddb")