Using Data Extraction Starter
Understanding how to use the Data Extraction Starter in Kodexa
Introduction
The Data Extraction Starter is a pre-built project template that you can use to get started with Kodexa. It is a great way to get started with Kodexa and can be used to extract data from a wide range of documents.
Whats the High-Level Approach?
The aim of the template is to provide a starting point for data extraction.
It provides a pre-configured project with the following features:
- A Processing Document Store for holding documents
- An Extracted Document Store for holding extracted data in a structured form
- An empty Data Definition that you can use to define the data you want to capture from documents
- A empty Data Form you can use to build a form based on your data definition
- A default task configuration that can be used with a simple task status lifecycle
- A default document status lifecycle
- A dataflow configured to use OCR, LLM-based Data Extraction, Validation and Task Creation
What is the basic flow?
A newly uploaded document will be processed by the following steps:
- OCR is used to extract text from the document
- The text is then passed to a LLM-based data extraction model
- The extracted data is then validated
- If the data is invalid, a task is created to review the data
- The task is then reviewed by a human
- If the task is approved, the changed data is saved to the Extracted Document Store
- If the task is rejected, the changed data is saved to the Extracted Document Store
- The document is then marked as processed
Document Statuses
Each document can exist in one of the following statuses:
Status | Color Badge | Slug | Stage |
---|---|---|---|
Failed | failed | Error | |
Pending Review | pending-review | Review | |
Reviewed | reviewed | Complete | |
Labeled | labeled | Processing | |
Rejected | rejected | Error | |
Transformed | transformed | Processing | |
Completed | completed | Complete | |
Prepared | prepared | Processing |
Status Flow in Processing Pipeline
Default Data Flow
The default data flow is pre-configured with everything needed to get started.
However, often you will want to add to the data flow. A typical example is being able to publish to external systems based on the completion (or failure) of a document.
In order to do this we will walk through how to create two new models, and show were to add them to the data flow.
Adding a custom model to publish to an external system
In this example we will add a custom model to publish to an external system.
Publishing Success
In order to publish success we can create a very simple model and add it to the Task Assistant.
You can simply create a new model and add it to the Task Assistant, first use the cookiecutter for an infer model, you can find the cookiecutter here.
Once you have created your model you can then add it to the Task Assistant at the end.
In the code you can then get the information from the document and publish it to an external system.
Publishing Failure
For publishing the failure we would usually use a scheduled event model, the reason for this is that we often want to defer publishing failures in case we are having a problem in processing.
One thing we want to be aware of is that we need to know, in Kodexa, if we have published the failure to the external system.
This can be done by adding a label to the document.
This means we would use the event model cookiecutter to create a model that runs every 10 minutes, you can find the cookiecutter here.
Once you have created your model you can then add it to the Schedule Assistant at the end.
In the code you can then get the information from the document and publish it to an external system.