Skip to main content
The extraction subsystem converts tagged document content into structured data objects and attributes using taxonomy definitions. It runs in the Go core via CFFI for performance.

Core Concepts

  • Taxonomy — Defines the structure of data to extract (groups, fields, types)
  • ExtractionEngine — Processes a document against taxonomies to produce data objects
  • DataObject — A structured record extracted from the document (e.g., an invoice header)
  • DataAttribute — A field within a data object (e.g., invoice number, date, total)
  • DataException — A validation error on a data object or attribute

Taxonomy

A Taxonomy wraps a Go-side taxonomy handle for use with the extraction engine. Create one from a dict or a JSON file path:
from kodexa_document import Taxonomy

# From a dictionary
taxonomy = Taxonomy(taxonomy_data={
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Header",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"}
            ]
        }
    ]
})

# From a JSON file
taxonomy = Taxonomy(taxonomy_path="/path/to/taxonomy.json")
Taxonomies implement the context manager protocol and should be closed when no longer needed:
with Taxonomy(taxonomy_data=data) as taxonomy:
    # Use taxonomy
    is_valid = taxonomy.validate()
    json_str = taxonomy.to_json()
# Automatically freed
If you don’t use a context manager, the Go handle is still cleaned up automatically via weakref.finalize, but using with is recommended for deterministic cleanup.

ExtractionEngine

The ExtractionEngine processes a document against one or more taxonomies to extract structured data.

Constructor

from kodexa_document import Document, ExtractionEngine, Taxonomy

with Document.from_kddb("tagged_invoice.kddb") as doc:
    taxonomy = Taxonomy(taxonomy_data=taxonomy_dict)

    engine = ExtractionEngine(
        document=doc,
        taxonomies=[taxonomy],
        owner_uri="model://my-org/invoice-model"  # optional
    )
ParameterTypeDescription
documentDocumentThe document to extract from
taxonomiesList[Taxonomy | dict]One or more taxonomies (dicts are auto-wrapped)
owner_uristr | NoneURI identifying the extraction owner (defaults to "model://default")

process_and_save

Runs extraction and persists the results (data objects, attributes, exceptions) into the document’s SQLite store:
with ExtractionEngine(doc, [taxonomy]) as engine:
    count = engine.process_and_save()
    print(f"Extracted {count} data objects")

Content Exceptions

After extraction, retrieve any content-level exceptions:
with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    exceptions = engine.get_content_exceptions()
    for exc in exceptions:
        print(f"Exception: {exc.message} at {exc.tag}")

Document Taxon Validations

Check which taxons were found or missing in the document:
with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    validations = engine.get_document_taxon_validations()
    for v in validations:
        print(f"{v.taxon_path}: {v.validation}")

DataObject

Represents a structured record extracted from the document. Created by the extraction engine, accessed via DataObjectAccessor.
FieldTypeDescription
idintDatabase ID
uuidstrUnique identifier
namestrDisplay name
typestrObject type
taxonomy_refstrReference to the source taxonomy
group_pathstrPath in the taxonomy group hierarchy
group_uuidstrUUID of the group this belongs to
parent_group_uuidstrUUID of the parent group
parent_idintParent data object ID
attributesList[DataAttribute]Child attributes
data_exceptionsList[DataException]Validation errors
childrenList[DataObject]Child data objects

DataAttribute

A field value within a data object.
FieldTypeDescription
uuidstrUnique identifier
data_object_idintParent data object ID
pathstrTaxonomy path of this attribute
tagstrTag name used for extraction
tag_uuidstrUUID of the source tag
valueAnyThe extracted value
confidencefloatExtraction confidence (0.0 - 1.0)
taxonomy_refstrSource taxonomy reference
taxon_typestrData type (STRING, DECIMAL, DATE, etc.)
node_uuidstrUUID of the content node this was extracted from
sourcestrSource identifier
manualboolWhether this was manually entered
data_exceptionsList[DataException]Validation errors

DataException

A validation error attached to a data object or attribute.
FieldTypeDescription
idintDatabase ID
uuidstrUnique identifier
messagestrHuman-readable error message
exception_typestrType classification
severitystrSeverity level
pathstrTaxonomy path where the error occurred
openboolWhether this exception is still open
data_object_idintRelated data object
data_attribute_idintRelated data attribute

DocumentTaxonValidation

Reports whether a specific taxon was found during extraction.
FieldTypeDescription
taxonomy_refstrThe taxonomy reference
taxon_pathstrPath of the taxon
validationstrValidation result

Complete Example

from kodexa_document import (
    Document, ExtractionEngine, Taxonomy,
    DataObjectAccessor, DataAttributeAccessor
)

# Define taxonomy
taxonomy_data = {
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Invoice",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"},
            ]
        }
    ]
}

with Document.from_kddb("tagged_invoice.kddb") as doc:
    # Create taxonomy and run extraction
    with Taxonomy(taxonomy_data=taxonomy_data) as taxonomy:
        with ExtractionEngine(doc, [taxonomy], owner_uri="model://my-org/invoice-model") as engine:
            count = engine.process_and_save()
            print(f"Extracted {count} data objects")

            # Check for exceptions
            exceptions = engine.get_content_exceptions()
            for exc in exceptions:
                print(f"Warning: {exc.message}")

            # Check validations
            validations = engine.get_document_taxon_validations()
            for v in validations:
                print(f"  {v.taxon_path}: {v.validation}")

    # Access the extracted data
    obj_accessor = DataObjectAccessor(doc)
    attr_accessor = DataAttributeAccessor(doc)

    for obj in obj_accessor.get_all():
        print(f"\nData Object: {obj.get('name')} ({obj.get('uuid')})")
        attrs = attr_accessor.get_for_data_object(obj["id"])
        for attr in attrs:
            print(f"  {attr.get('path')}: {attr.get('value')} "
                  f"(confidence: {attr.get('confidence', 'N/A')})")

    # Save the document with extracted data
    doc.save("extracted_invoice.kddb")