The extraction subsystem converts tagged document content into structured data objects and attributes using taxonomy definitions. It runs in the Go core via CFFI for performance.
Core Concepts
- Taxonomy — Defines the structure of data to extract (groups, fields, types)
- ExtractionEngine — Processes a document against taxonomies to produce data objects
- DataObject — A structured record extracted from the document (e.g., an invoice header)
- DataAttribute — A field within a data object (e.g., invoice number, date, total)
- DataException — A validation error on a data object or attribute
Taxonomy
A Taxonomy wraps a Go-side taxonomy handle for use with the extraction engine. Create one from a dict or a JSON file path:
from kodexa_document import Taxonomy
# From a dictionary
taxonomy = Taxonomy(taxonomy_data={
"name": "Invoice",
"slug": "invoice",
"taxons": [
{
"name": "Header",
"taxonType": "GROUP",
"children": [
{"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
{"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
{"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"}
]
}
]
})
# From a JSON file
taxonomy = Taxonomy(taxonomy_path="/path/to/taxonomy.json")
Taxonomies implement the context manager protocol and should be closed when no longer needed:
with Taxonomy(taxonomy_data=data) as taxonomy:
# Use taxonomy
is_valid = taxonomy.validate()
json_str = taxonomy.to_json()
# Automatically freed
If you don’t use a context manager, the Go handle is still cleaned up automatically via weakref.finalize, but using with is recommended for deterministic cleanup.
The ExtractionEngine processes a document against one or more taxonomies to extract structured data.
Constructor
from kodexa_document import Document, ExtractionEngine, Taxonomy
with Document.from_kddb("tagged_invoice.kddb") as doc:
taxonomy = Taxonomy(taxonomy_data=taxonomy_dict)
engine = ExtractionEngine(
document=doc,
taxonomies=[taxonomy],
owner_uri="model://my-org/invoice-model" # optional
)
| Parameter | Type | Description |
|---|
document | Document | The document to extract from |
taxonomies | List[Taxonomy | dict] | One or more taxonomies (dicts are auto-wrapped) |
owner_uri | str | None | URI identifying the extraction owner (defaults to "model://default") |
process_and_save
Runs extraction and persists the results (data objects, attributes, exceptions) into the document’s SQLite store:
with ExtractionEngine(doc, [taxonomy]) as engine:
count = engine.process_and_save()
print(f"Extracted {count} data objects")
Content Exceptions
After extraction, retrieve any content-level exceptions:
with ExtractionEngine(doc, [taxonomy]) as engine:
engine.process_and_save()
exceptions = engine.get_content_exceptions()
for exc in exceptions:
print(f"Exception: {exc.message} at {exc.tag}")
Document Taxon Validations
Check which taxons were found or missing in the document:
with ExtractionEngine(doc, [taxonomy]) as engine:
engine.process_and_save()
validations = engine.get_document_taxon_validations()
for v in validations:
print(f"{v.taxon_path}: {v.validation}")
DataObject
Represents a structured record extracted from the document. Created by the extraction engine, accessed via DataObjectAccessor.
| Field | Type | Description |
|---|
id | int | Database ID |
uuid | str | Unique identifier |
name | str | Display name |
type | str | Object type |
taxonomy_ref | str | Reference to the source taxonomy |
group_path | str | Path in the taxonomy group hierarchy |
group_uuid | str | UUID of the group this belongs to |
parent_group_uuid | str | UUID of the parent group |
parent_id | int | Parent data object ID |
attributes | List[DataAttribute] | Child attributes |
data_exceptions | List[DataException] | Validation errors |
children | List[DataObject] | Child data objects |
DataAttribute
A field value within a data object.
| Field | Type | Description |
|---|
uuid | str | Unique identifier |
data_object_id | int | Parent data object ID |
path | str | Taxonomy path of this attribute |
tag | str | Tag name used for extraction |
tag_uuid | str | UUID of the source tag |
value | Any | The extracted value |
confidence | float | Extraction confidence (0.0 - 1.0) |
taxonomy_ref | str | Source taxonomy reference |
taxon_type | str | Data type (STRING, DECIMAL, DATE, etc.) |
node_uuid | str | UUID of the content node this was extracted from |
source | str | Source identifier |
manual | bool | Whether this was manually entered |
data_exceptions | List[DataException] | Validation errors |
DataException
A validation error attached to a data object or attribute.
| Field | Type | Description |
|---|
id | int | Database ID |
uuid | str | Unique identifier |
message | str | Human-readable error message |
exception_type | str | Type classification |
severity | str | Severity level |
path | str | Taxonomy path where the error occurred |
open | bool | Whether this exception is still open |
data_object_id | int | Related data object |
data_attribute_id | int | Related data attribute |
DocumentTaxonValidation
Reports whether a specific taxon was found during extraction.
| Field | Type | Description |
|---|
taxonomy_ref | str | The taxonomy reference |
taxon_path | str | Path of the taxon |
validation | str | Validation result |
Complete Example
from kodexa_document import (
Document, ExtractionEngine, Taxonomy,
DataObjectAccessor, DataAttributeAccessor
)
# Define taxonomy
taxonomy_data = {
"name": "Invoice",
"slug": "invoice",
"taxons": [
{
"name": "Invoice",
"taxonType": "GROUP",
"children": [
{"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
{"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
{"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"},
]
}
]
}
with Document.from_kddb("tagged_invoice.kddb") as doc:
# Create taxonomy and run extraction
with Taxonomy(taxonomy_data=taxonomy_data) as taxonomy:
with ExtractionEngine(doc, [taxonomy], owner_uri="model://my-org/invoice-model") as engine:
count = engine.process_and_save()
print(f"Extracted {count} data objects")
# Check for exceptions
exceptions = engine.get_content_exceptions()
for exc in exceptions:
print(f"Warning: {exc.message}")
# Check validations
validations = engine.get_document_taxon_validations()
for v in validations:
print(f" {v.taxon_path}: {v.validation}")
# Access the extracted data
obj_accessor = DataObjectAccessor(doc)
attr_accessor = DataAttributeAccessor(doc)
for obj in obj_accessor.get_all():
print(f"\nData Object: {obj.get('name')} ({obj.get('uuid')})")
attrs = attr_accessor.get_for_data_object(obj["id"])
for attr in attrs:
print(f" {attr.get('path')}: {attr.get('value')} "
f"(confidence: {attr.get('confidence', 'N/A')})")
# Save the document with extracted data
doc.save("extracted_invoice.kddb")