> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Extraction

> Use the Kodexa Python SDK extraction subsystem to convert tagged document content into structured data objects and attributes via taxonomy definitions.

The extraction subsystem converts tagged document content into structured data objects and attributes using taxonomy definitions. It runs in the Go core via CFFI for performance.

## Core Concepts

* **Taxonomy** — Defines the structure of data to extract (groups, fields, types)
* **ExtractionEngine** — Processes a document against taxonomies to produce data objects
* **DataObject** — A structured record extracted from the document (e.g., an invoice header)
* **DataAttribute** — A field within a data object (e.g., invoice number, date, total)
* **DataException** — A validation error on a data object or attribute

## Taxonomy

A `Taxonomy` wraps a Go-side taxonomy handle for use with the extraction engine. Create one from a dict or a JSON file path:

```python theme={null}
from kodexa_document import Taxonomy

# From a dictionary
taxonomy = Taxonomy(taxonomy_data={
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Header",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"}
            ]
        }
    ]
})

# From a JSON file
taxonomy = Taxonomy(taxonomy_path="/path/to/taxonomy.json")
```

Taxonomies implement the context manager protocol and should be closed when no longer needed:

```python theme={null}
with Taxonomy(taxonomy_data=data) as taxonomy:
    # Use taxonomy
    is_valid = taxonomy.validate()
    json_str = taxonomy.to_json()
# Automatically freed
```

<Note>
  If you don't use a context manager, the Go handle is still cleaned up automatically via `weakref.finalize`, but using `with` is recommended for deterministic cleanup.
</Note>

## ExtractionEngine

The `ExtractionEngine` processes a document against one or more taxonomies to extract structured data.

### Constructor

```python theme={null}
from kodexa_document import Document, ExtractionEngine, Taxonomy

with Document.from_kddb("tagged_invoice.kddb") as doc:
    taxonomy = Taxonomy(taxonomy_data=taxonomy_dict)

    engine = ExtractionEngine(
        document=doc,
        taxonomies=[taxonomy],
        owner_uri="model://my-org/invoice-model"  # optional
    )
```

| Parameter    | Type                     | Description                                                            |
| ------------ | ------------------------ | ---------------------------------------------------------------------- |
| `document`   | `Document`               | The document to extract from                                           |
| `taxonomies` | `List[Taxonomy \| dict]` | One or more taxonomies (dicts are auto-wrapped)                        |
| `owner_uri`  | `str \| None`            | URI identifying the extraction owner (defaults to `"model://default"`) |

### process\_and\_save

Runs extraction and persists the results (data objects, attributes, exceptions) into the document's SQLite store:

```python theme={null}
with ExtractionEngine(doc, [taxonomy]) as engine:
    count = engine.process_and_save()
    print(f"Extracted {count} data objects")
```

### Content Exceptions

After extraction, retrieve any content-level exceptions:

```python theme={null}
with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    exceptions = engine.get_content_exceptions()
    for exc in exceptions:
        print(f"Exception: {exc.message} at {exc.tag}")
```

### Document Taxon Validations

Check which taxons were found or missing in the document:

```python theme={null}
with ExtractionEngine(doc, [taxonomy]) as engine:
    engine.process_and_save()
    validations = engine.get_document_taxon_validations()
    for v in validations:
        print(f"{v.taxon_path}: {v.validation}")
```

## DataObject

Represents a structured record extracted from the document. Created by the extraction engine, accessed via `DataObjectAccessor`.

| Field               | Type                  | Description                          |
| ------------------- | --------------------- | ------------------------------------ |
| `id`                | `int`                 | Database ID                          |
| `uuid`              | `str`                 | Unique identifier                    |
| `name`              | `str`                 | Display name                         |
| `type`              | `str`                 | Object type                          |
| `taxonomy_ref`      | `str`                 | Reference to the source taxonomy     |
| `group_path`        | `str`                 | Path in the taxonomy group hierarchy |
| `group_uuid`        | `str`                 | UUID of the group this belongs to    |
| `parent_group_uuid` | `str`                 | UUID of the parent group             |
| `parent_id`         | `int`                 | Parent data object ID                |
| `attributes`        | `List[DataAttribute]` | Child attributes                     |
| `data_exceptions`   | `List[DataException]` | Validation errors                    |
| `children`          | `List[DataObject]`    | Child data objects                   |

## DataAttribute

A field value within a data object.

| Field             | Type                  | Description                                      |
| ----------------- | --------------------- | ------------------------------------------------ |
| `uuid`            | `str`                 | Unique identifier                                |
| `data_object_id`  | `int`                 | Parent data object ID                            |
| `path`            | `str`                 | Taxonomy path of this attribute                  |
| `tag`             | `str`                 | Tag name used for extraction                     |
| `tag_uuid`        | `str`                 | UUID of the source tag                           |
| `value`           | `Any`                 | The extracted value                              |
| `confidence`      | `float`               | Extraction confidence (0.0 - 1.0)                |
| `taxonomy_ref`    | `str`                 | Source taxonomy reference                        |
| `taxon_type`      | `str`                 | Data type (STRING, DECIMAL, DATE, etc.)          |
| `node_uuid`       | `str`                 | UUID of the content node this was extracted from |
| `source`          | `str`                 | Source identifier                                |
| `manual`          | `bool`                | Whether this was manually entered                |
| `data_exceptions` | `List[DataException]` | Validation errors                                |

## DataException

A validation error attached to a data object or attribute.

| Field               | Type   | Description                            |
| ------------------- | ------ | -------------------------------------- |
| `id`                | `int`  | Database ID                            |
| `uuid`              | `str`  | Unique identifier                      |
| `message`           | `str`  | Human-readable error message           |
| `exception_type`    | `str`  | Type classification                    |
| `severity`          | `str`  | Severity level                         |
| `path`              | `str`  | Taxonomy path where the error occurred |
| `open`              | `bool` | Whether this exception is still open   |
| `data_object_id`    | `int`  | Related data object                    |
| `data_attribute_id` | `int`  | Related data attribute                 |

## DocumentTaxonValidation

Reports whether a specific taxon was found during extraction.

| Field          | Type  | Description            |
| -------------- | ----- | ---------------------- |
| `taxonomy_ref` | `str` | The taxonomy reference |
| `taxon_path`   | `str` | Path of the taxon      |
| `validation`   | `str` | Validation result      |

## Complete Example

```python theme={null}
from kodexa_document import (
    Document, ExtractionEngine, Taxonomy,
    DataObjectAccessor, DataAttributeAccessor
)

# Define taxonomy
taxonomy_data = {
    "name": "Invoice",
    "slug": "invoice",
    "taxons": [
        {
            "name": "Invoice",
            "taxonType": "GROUP",
            "children": [
                {"name": "Invoice Number", "tag": "invoice_number", "taxonType": "STRING"},
                {"name": "Date", "tag": "invoice_date", "taxonType": "DATE"},
                {"name": "Total", "tag": "invoice_total", "taxonType": "DECIMAL"},
            ]
        }
    ]
}

with Document.from_kddb("tagged_invoice.kddb") as doc:
    # Create taxonomy and run extraction
    with Taxonomy(taxonomy_data=taxonomy_data) as taxonomy:
        with ExtractionEngine(doc, [taxonomy], owner_uri="model://my-org/invoice-model") as engine:
            count = engine.process_and_save()
            print(f"Extracted {count} data objects")

            # Check for exceptions
            exceptions = engine.get_content_exceptions()
            for exc in exceptions:
                print(f"Warning: {exc.message}")

            # Check validations
            validations = engine.get_document_taxon_validations()
            for v in validations:
                print(f"  {v.taxon_path}: {v.validation}")

    # Access the extracted data
    obj_accessor = DataObjectAccessor(doc)
    attr_accessor = DataAttributeAccessor(doc)

    for obj in obj_accessor.get_all():
        print(f"\nData Object: {obj.get('name')} ({obj.get('uuid')})")
        attrs = attr_accessor.get_for_data_object(obj["id"])
        for attr in attrs:
            print(f"  {attr.get('path')}: {attr.get('value')} "
                  f"(confidence: {attr.get('confidence', 'N/A')})")

    # Save the document with extracted data
    doc.save("extracted_invoice.kddb")
```