Extraction - Kodexa Developer Portal

The extraction subsystem converts tagged document content into structured data objects and attributes using taxonomy definitions. It runs in the Go core via CFFI for performance.

Core Concepts

Taxonomy — Defines the structure of data to extract (groups, fields, types)
ExtractionEngine — Processes a document against taxonomies to produce data objects
DataObject — A structured record extracted from the document (e.g., an invoice header)
DataAttribute — A field within a data object (e.g., invoice number, date, total)
DataException — A validation error on a data object or attribute

Taxonomy

A Taxonomy wraps a Go-side taxonomy handle for use with the extraction engine. Create one from a dict or a JSON file path:

from kodexa_document import Taxonomy

# From a dictionary
taxonomy = Taxonomy(taxonomy_data={
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Header",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"}
            ]
        }
    ]
})

# From a JSON file
taxonomy = Taxonomy(taxonomy_path="/path/to/taxonomy.json")

Taxonomies implement the context manager protocol and should be closed when no longer needed:

with Taxonomy(taxonomy_data=data) as taxonomy:
    # Use taxonomy
    is_valid = taxonomy.validate()
    json_str = taxonomy.to_json()
# Automatically freed

If you don’t use a context manager, the Go handle is still cleaned up automatically via weakref.finalize, but using with is recommended for deterministic cleanup.

ExtractionEngine

The ExtractionEngine processes a document against one or more taxonomies to extract structured data.

Constructor

from kodexa_document import Document, ExtractionEngine, Taxonomy

with Document.from_kddb("tagged_invoice.kddb") as doc:
    taxonomy = Taxonomy(taxonomy_data=taxonomy_dict)

    engine = ExtractionEngine(
        document=doc,
        taxonomies=[taxonomy],
        owner_uri="model://my-org/invoice-model"  # optional
    )

Parameter	Type	Description
`document`	`Document`	The document to extract from
`taxonomies`	`List[Taxonomy \| dict]`	One or more taxonomies (dicts are auto-wrapped)
`owner_uri`	`str \| None`	URI identifying the extraction owner (defaults to `"model://default"`)

process_and_save

Runs extraction and persists the results (data objects, attributes, exceptions) into the document’s SQLite store:

with ExtractionEngine(doc, [taxonomy]) as engine:
    count = engine.process_and_save()
    print(f"Extracted {count} data objects")

Content Exceptions

After extraction, retrieve any content-level exceptions:

with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    exceptions = engine.get_content_exceptions()
    for exc in exceptions:
        print(f"Exception: {exc.message} at {exc.tag}")

Document Taxon Validations

Check which taxons were found or missing in the document:

with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    validations = engine.get_document_taxon_validations()
    for v in validations:
        print(f"{v.taxon_path}: {v.validation}")

DataObject

Represents a structured record extracted from the document. Created by the extraction engine, accessed via DataObjectAccessor.

Field	Type	Description
`id`	`int`	Database ID
`uuid`	`str`	Unique identifier
`name`	`str`	Display name
`type`	`str`	Object type
`taxonomy_ref`	`str`	Reference to the source taxonomy
`group_path`	`str`	Path in the taxonomy group hierarchy
`group_uuid`	`str`	UUID of the group this belongs to
`parent_group_uuid`	`str`	UUID of the parent group
`parent_id`	`int`	Parent data object ID
`attributes`	`List[DataAttribute]`	Child attributes
`data_exceptions`	`List[DataException]`	Validation errors
`children`	`List[DataObject]`	Child data objects

DataAttribute

A field value within a data object.

Field	Type	Description
`uuid`	`str`	Unique identifier
`data_object_id`	`int`	Parent data object ID
`path`	`str`	Taxonomy path of this attribute
`tag`	`str`	Tag name used for extraction
`tag_uuid`	`str`	UUID of the source tag
`value`	`Any`	The extracted value
`confidence`	`float`	Extraction confidence (0.0 - 1.0)
`taxonomy_ref`	`str`	Source taxonomy reference
`taxon_type`	`str`	Data type (STRING, DECIMAL, DATE, etc.)
`node_uuid`	`str`	UUID of the content node this was extracted from
`source`	`str`	Source identifier
`manual`	`bool`	Whether this was manually entered
`data_exceptions`	`List[DataException]`	Validation errors

DataException

A validation error attached to a data object or attribute.

Field	Type	Description
`id`	`int`	Database ID
`uuid`	`str`	Unique identifier
`message`	`str`	Human-readable error message
`exception_type`	`str`	Type classification
`severity`	`str`	Severity level
`path`	`str`	Taxonomy path where the error occurred
`open`	`bool`	Whether this exception is still open
`data_object_id`	`int`	Related data object
`data_attribute_id`	`int`	Related data attribute

DocumentTaxonValidation

Reports whether a specific taxon was found during extraction.

Field	Type	Description
`taxonomy_ref`	`str`	The taxonomy reference
`taxon_path`	`str`	Path of the taxon
`validation`	`str`	Validation result

Complete Example

from kodexa_document import (
    Document, ExtractionEngine, Taxonomy,
    DataObjectAccessor, DataAttributeAccessor
)

# Define taxonomy
taxonomy_data = {
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Invoice",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"},
            ]
        }
    ]
}

with Document.from_kddb("tagged_invoice.kddb") as doc:
    # Create taxonomy and run extraction
    with Taxonomy(taxonomy_data=taxonomy_data) as taxonomy:
        with ExtractionEngine(doc, [taxonomy], owner_uri="model://my-org/invoice-model") as engine:
            count = engine.process_and_save()
            print(f"Extracted {count} data objects")

            # Check for exceptions
            exceptions = engine.get_content_exceptions()
            for exc in exceptions:
                print(f"Warning: {exc.message}")

            # Check validations
            validations = engine.get_document_taxon_validations()
            for v in validations:
                print(f"  {v.taxon_path}: {v.validation}")

    # Access the extracted data
    obj_accessor = DataObjectAccessor(doc)
    attr_accessor = DataAttributeAccessor(doc)

    for obj in obj_accessor.get_all():
        print(f"\nData Object: {obj.get('name')} ({obj.get('uuid')})")
        attrs = attr_accessor.get_for_data_object(obj["id"])
        for attr in attrs:
            print(f"  {attr.get('path')}: {attr.get('value')} "
                  f"(confidence: {attr.get('confidence', 'N/A')})")

    # Save the document with extracted data
    doc.save("extracted_invoice.kddb")

Documentation Index

​Core Concepts

​Taxonomy

​ExtractionEngine

​Constructor

​process_and_save

​Content Exceptions

​Document Taxon Validations

​DataObject

​DataAttribute

​DataException

​DocumentTaxonValidation

​Complete Example

Core Concepts

Taxonomy

ExtractionEngine

Constructor

process_and_save

Content Exceptions

Document Taxon Validations

DataObject

DataAttribute

DataException

DocumentTaxonValidation

Complete Example