Introduction

The Data Extraction Starter is a pre-built project template that you can use to get started with Kodexa. It is a great way to get started with Kodexa and can be used to extract data from a wide range of documents.

Whats the High-Level Approach?

The aim of the template is to provide a starting point for data extraction. It provides a pre-configured project with the following features:

A Processing Document Store for holding documents
An Extracted Document Store for holding extracted data in a structured form
An empty Data Definition that you can use to define the data you want to capture from documents
A empty Data Form you can use to build a form based on your data definition
A default task configuration that can be used with a simple task status lifecycle
A default document status lifecycle
A dataflow configured to use OCR, LLM-based Data Extraction, Validation and Task Creation

What is the basic flow?

A newly uploaded document will be processed by the following steps:

OCR is used to extract text from the document
The text is then passed to a LLM-based data extraction model
The extracted data is then validated
If the data is invalid, a task is created to review the data
- The task is then reviewed by a human
- If the task is approved, the changed data is saved to the Extracted Document Store
- If the task is rejected, the changed data is saved to the Extracted Document Store
The document is then marked as processed

Document Statuses

Each document can exist in one of the following statuses:

Status	Slug	Stage
Failed	`failed`	Error
Pending Review	`pending-review`	Review
Reviewed	`reviewed`	Complete
Labeled	`labeled`	Processing
Rejected	`rejected`	Error
Transformed	`transformed`	Processing
Completed	`completed`	Complete
Prepared	`prepared`	Processing

Status Flow in Processing Pipeline

Default Data Flow

The default data flow is pre-configured with everything needed to get started.

However, often you will want to add to the data flow. A typical example is being able to publish to external systems based on the completion (or failure) of a document. In order to do this we will walk through how to create two new models, and show were to add them to the data flow.

Adding a custom model to publish to an external system

In this example we will add a custom model to publish to an external system.

Publishing Success

In order to publish success we can create a very simple model and add it to the Task Assistant. You can simply create a new model and add it to the Task Assistant, first use the cookiecutter for an infer model, you can find the cookiecutter here. Once you have created your model you can then add it to the Task Assistant at the end. In the code you can then get the information from the document and publish it to an external system.

def infer(document: Document, project: ProjectEndpoint, pipeline_context: PipelineContext, assistant: Assistant):
    
    # Add the logic to allow you to publish a successful document
    
    return document

Publishing Failure

For publishing the failure we would usually use a scheduled event model, the reason for this is that we often want to defer publishing failures in case we are having a problem in processing. One thing we want to be aware of is that we need to know, in Kodexa, if we have published the failure to the external system. This can be done by adding a label to the document. This means we would use the event model cookiecutter to create a model that runs every 10 minutes, you can find the cookiecutter here. Once you have created your model you can then add it to the Schedule Assistant at the end. In the code you can then get the information from the document and publish it to an external system.

def handle_event(event: BaseEvent):
    logger.info(f"Handle event called with event: {event.type}")
    if event.type == "scheduled":
        client = KodexaClient()

        document_store: DocumentStoreEndpoint = client.get_object_by_ref("store","my-org/store-slug")

        failed_document_families = document_store.stream_filter("documentStatus.slug:'failed' and not exists(labels.label('published-failure'))")

        for failed_document_family in failed_document_families:
            
            # Add the logic to publish the failure to the external system
            logger.info(f"Publishing failure for document family: {failed_document_family.path}")


            # Then add a label to the document so we don't pick it up again
            failed_document_family.add_label("published-failure")

Introduction

​Introduction

​Whats the High-Level Approach?

​What is the basic flow?

​Document Statuses

​Status Flow in Processing Pipeline

​Default Data Flow

​Adding a custom model to publish to an external system

​Publishing Success

​Publishing Failure

Introduction

Whats the High-Level Approach?

What is the basic flow?

Document Statuses

Status Flow in Processing Pipeline

Default Data Flow

Adding a custom model to publish to an external system

Publishing Success

Publishing Failure