Kodexa LLM Data Labeling
Leverages the Kodexa Data Definition and builds prompts for labeling with Large Language Models
Slug: llm-taxonomy-model
Version: 1.0.0
Infer: Yes
Event Aware: No
Overview
LLM Data Labeling Model
The Kodexa LLM Data Labeling model enables intelligent extraction of structured data from documents using AI-powered recognition based on a data definition (taxonomy). It combines pattern recognition with Large Language Models to accurately identify and extract information from documents.
How It Works
- Data Definition-Based Extraction: The model uses your data definition (taxonomy) to identify what data to extract from documents
- Intelligent Chunking: Documents are broken into manageable pieces based on your selected strategy
- Content Classification: AI analyzes content to identify relevant sections for each data element
- LLM-Powered Extraction: Advanced language models process chunks to extract structured data
- Merging and Validation: Results from multiple chunks are intelligently combined and validated
Core Features
- Flexible Chunking Strategies: Multiple approaches for breaking documents into processable pieces
- Multi-Level Classification: Identifies document types and specific data regions
- Automatic Labeling: Tags document pages and regions based on extracted content
- Embedding Support: Optional vector-based similarity search for classification
- Structure Review: AI-based validation of extracted structure
- Intelligent Merging: Combines results from multiple document chunks
Options Configuration
Option | Description |
---|---|
taxonomy | The data definition to use for extraction, defining the structure of data to extract |
label_document | When enabled, adds labels to the document based on extracted data |
set_external_data | Sets external data properties based on extracted structure |
apply_guidance | Uses pre-defined guidance to improve extraction accuracy |
external_data_key | Key to use when setting external data |
Process Flow
Feature Options
Chunking Strategies
Controls how documents are divided for processing:
- Whole Document: Process entire document (limited by context window)
- Page: Process one page at a time
- First n Pages: Process only the first n pages
- Records: Detect and process individual records, even across pages
- Classified Content: Process all content of a specific type together
- Page Classified Content: Process classified content by page
- Consecutive Classified Content: Group consecutive classified content
- Group Classified Content: Group classified content across pages
Classification Strategies
Controls how document sections are classified:
- None: No classification
- Data Element: Classify based on data element definition
- Data Element and Children: Include child elements in classification
- Feature: Use custom feature for classification
Classification Content
Determines what content is used for classification:
- Text: Use text content (default)
- Bounding Boxes: Use spatial layout information
- Images & Bounding Boxes: Use visual and spatial information
- Image: Use only image data
- Embeddings: Use vector embeddings for similarity matching
Example Usage
The LLM Data Labeling model is particularly useful for:
- Extracting structured data from semi-structured documents
- Building automated document processing pipelines
- Creating training datasets for machine learning
- Validating document content against expected structure
- Integrating document information with downstream systems
Advanced Features
- Structure Review: Validates extracted structure against expected schema
- AI-Assisted Merging: Uses AI to resolve conflicts when merging data from multiple chunks
- Line Fallback: Automatically falls back to line-level processing for complex cases
- Thinking Mode: Enables LLM reasoning traces for better extraction quality
- Custom Model Selection: Override default models for classification and extraction
Inference Options
The following options can be configured when using this model for inference:
Name | Label | Type | Description | Default | Required |
---|---|---|---|---|---|
taxonomy | Data Definition | taxonomy | The data definition to use for the model | - | No |
label_document | Label Document | boolean | Label the document | True | No |
set_external_data | Set External Data | boolean | Set the external data to the structure from the data classes | False | No |
apply_guidance | Apply Guidance | boolean | Apply guidance, if found | False | No |
external_data_key | External Data Key | string | N/A | - | No |
Model Details
- Provider: Kodexa
Feature Configuration Options
These options allow for fine-tuning the model’s features during configuration:
Option Group 1
Name | Label | Type | Description | Default |
---|---|---|---|---|
enable_line_fallback | Enable Line Fallback | boolean | Fallback to line level labeling if multiple lines and unable to find content | False |
raise_exception_on_fallback | Raise Exception on Fallback | boolean | Raise an exception if we fallback to line level labeling | False |
Option Group 2
Name | Label | Type | Description | Default |
---|---|---|---|---|
- | N/A | article | N/A | - |
embedded | Embedded | boolean | Treat as embedded | False |
cardinality | Cardinality | string | The cardinality of the data element in the chunk | single |
classificationStrategy | Classification Strategy | string | Should we chunk using the data element for classification | dataElement |
classificationContent | Classification Content | string | What content should we use for classification | text |
maxHits | Max Embedding Hits | number | The maximum number of hits to return from embeddings | 5 |
includeExplanation | Include Explanation | boolean | Try to include an explanation of the classification in the response, useful for debugging | True |
ignoreNonWords | Ignore Non-Words | boolean | Ignore non-word tokens when classifying | True |
restrictClassification | Restrict Classification | boolean | Restrict to only this classification (no mixed classes allowed) | True |
rerank | Rerank Classification Matches | boolean | Should we rerank the classification results | False |
maxPagesFromRerank | Max Pages From Rerank | number | The maximum number of pages to return from the rerank | 5 |
chunkingStrategy | Chunking Strategy | string | How should we chunk the document for the LLM | classifiedContent |
nPages | Number of Pages | number | The number of pages to use for the chunking strategy | 5 |
tagPage | Label Page | boolean | Label the page, if classified | True |
labelDocument | Tag Document | boolean | Add a tag to the document if classified | True |
promptStrategy | Prompt Strategy | string | Which prompt strategy should we use | layout |
image_width | Image Width | number | The width of the image | 350 |
skipExtraction | Skip Extraction | boolean | Should we skip extraction for this data element | False |
includeImages | Include Images | boolean | Should we include images in the prompt even if the strategy doesn’t normally include them | False |
enableThinkingMode | Enable Thinking Mode | boolean | Should the LLM use thinking mode (if it is available for the selected extraction model)? | False |
overrideExtractionModel | Override Extraction Model | boolean | Should we override the extraction model | False |
extractionModel | Extraction Model | cloudModel | Choose a model if you wish to override the extraction model | anthropic.claude-3-haiku-20240307-v1:0 |
enableStructureReview | Enable Structure Review | boolean | Should we enable the structure review | True |
structureReview | Structure Review Model | cloudModel | Choose a model if you wish to review the structure | anthropic.claude-3-haiku-20240307-v1:0 |
merge | Merge | boolean | Merge the objects identified in the chunks | True |
mergeWithAI | Merge with AI | boolean | Use AI to review chunks that are grouped and merge them into a single representation | False |
mergeInstructions | Merge Instructions | string | Additional instructions for merging the results to be included in the merge prompt | - |