Slug: llm-taxonomy-model Version: 1.0.0 Infer: Yes Event Aware: No

Overview

LLM Data Labeling Model

The Kodexa LLM Data Labeling model enables intelligent extraction of structured data from documents using AI-powered recognition based on a data definition (taxonomy). It combines pattern recognition with Large Language Models to accurately identify and extract information from documents.

How It Works

  1. Data Definition-Based Extraction: The model uses your data definition (taxonomy) to identify what data to extract from documents
  2. Intelligent Chunking: Documents are broken into manageable pieces based on your selected strategy
  3. Content Classification: AI analyzes content to identify relevant sections for each data element
  4. LLM-Powered Extraction: Advanced language models process chunks to extract structured data
  5. Merging and Validation: Results from multiple chunks are intelligently combined and validated

Core Features

  • Flexible Chunking Strategies: Multiple approaches for breaking documents into processable pieces
  • Multi-Level Classification: Identifies document types and specific data regions
  • Automatic Labeling: Tags document pages and regions based on extracted content
  • Embedding Support: Optional vector-based similarity search for classification
  • Structure Review: AI-based validation of extracted structure
  • Intelligent Merging: Combines results from multiple document chunks

Options Configuration

OptionDescription
taxonomyThe data definition to use for extraction, defining the structure of data to extract
label_documentWhen enabled, adds labels to the document based on extracted data
set_external_dataSets external data properties based on extracted structure
apply_guidanceUses pre-defined guidance to improve extraction accuracy
external_data_keyKey to use when setting external data

Process Flow

Feature Options

Chunking Strategies

Controls how documents are divided for processing:

  • Whole Document: Process entire document (limited by context window)
  • Page: Process one page at a time
  • First n Pages: Process only the first n pages
  • Records: Detect and process individual records, even across pages
  • Classified Content: Process all content of a specific type together
  • Page Classified Content: Process classified content by page
  • Consecutive Classified Content: Group consecutive classified content
  • Group Classified Content: Group classified content across pages

Classification Strategies

Controls how document sections are classified:

  • None: No classification
  • Data Element: Classify based on data element definition
  • Data Element and Children: Include child elements in classification
  • Feature: Use custom feature for classification

Classification Content

Determines what content is used for classification:

  • Text: Use text content (default)
  • Bounding Boxes: Use spatial layout information
  • Images & Bounding Boxes: Use visual and spatial information
  • Image: Use only image data
  • Embeddings: Use vector embeddings for similarity matching

Example Usage

The LLM Data Labeling model is particularly useful for:

  • Extracting structured data from semi-structured documents
  • Building automated document processing pipelines
  • Creating training datasets for machine learning
  • Validating document content against expected structure
  • Integrating document information with downstream systems

Advanced Features

  • Structure Review: Validates extracted structure against expected schema
  • AI-Assisted Merging: Uses AI to resolve conflicts when merging data from multiple chunks
  • Line Fallback: Automatically falls back to line-level processing for complex cases
  • Thinking Mode: Enables LLM reasoning traces for better extraction quality
  • Custom Model Selection: Override default models for classification and extraction

Inference Options

The following options can be configured when using this model for inference:

NameLabelTypeDescriptionDefaultRequired
taxonomyData DefinitiontaxonomyThe data definition to use for the model-No
label_documentLabel DocumentbooleanLabel the documentTrueNo
set_external_dataSet External DatabooleanSet the external data to the structure from the data classesFalseNo
apply_guidanceApply GuidancebooleanApply guidance, if foundFalseNo
external_data_keyExternal Data KeystringN/A-No

Model Details

  • Provider: Kodexa

Feature Configuration Options

These options allow for fine-tuning the model’s features during configuration:

Option Group 1

NameLabelTypeDescriptionDefault
enable_line_fallbackEnable Line FallbackbooleanFallback to line level labeling if multiple lines and unable to find contentFalse
raise_exception_on_fallbackRaise Exception on FallbackbooleanRaise an exception if we fallback to line level labelingFalse

Option Group 2

NameLabelTypeDescriptionDefault
-N/AarticleN/A-
embeddedEmbeddedbooleanTreat as embeddedFalse
cardinalityCardinalitystringThe cardinality of the data element in the chunksingle
classificationStrategyClassification StrategystringShould we chunk using the data element for classificationdataElement
classificationContentClassification ContentstringWhat content should we use for classificationtext
maxHitsMax Embedding HitsnumberThe maximum number of hits to return from embeddings5
includeExplanationInclude ExplanationbooleanTry to include an explanation of the classification in the response, useful for debuggingTrue
ignoreNonWordsIgnore Non-WordsbooleanIgnore non-word tokens when classifyingTrue
restrictClassificationRestrict ClassificationbooleanRestrict to only this classification (no mixed classes allowed)True
rerankRerank Classification MatchesbooleanShould we rerank the classification resultsFalse
maxPagesFromRerankMax Pages From ReranknumberThe maximum number of pages to return from the rerank5
chunkingStrategyChunking StrategystringHow should we chunk the document for the LLMclassifiedContent
nPagesNumber of PagesnumberThe number of pages to use for the chunking strategy5
tagPageLabel PagebooleanLabel the page, if classifiedTrue
labelDocumentTag DocumentbooleanAdd a tag to the document if classifiedTrue
promptStrategyPrompt StrategystringWhich prompt strategy should we uselayout
image_widthImage WidthnumberThe width of the image350
skipExtractionSkip ExtractionbooleanShould we skip extraction for this data elementFalse
includeImagesInclude ImagesbooleanShould we include images in the prompt even if the strategy doesn’t normally include themFalse
enableThinkingModeEnable Thinking ModebooleanShould the LLM use thinking mode (if it is available for the selected extraction model)?False
overrideExtractionModelOverride Extraction ModelbooleanShould we override the extraction modelFalse
extractionModelExtraction ModelcloudModelChoose a model if you wish to override the extraction modelanthropic.claude-3-haiku-20240307-v1:0
enableStructureReviewEnable Structure ReviewbooleanShould we enable the structure reviewTrue
structureReviewStructure Review ModelcloudModelChoose a model if you wish to review the structureanthropic.claude-3-haiku-20240307-v1:0
mergeMergebooleanMerge the objects identified in the chunksTrue
mergeWithAIMerge with AIbooleanUse AI to review chunks that are grouped and merge them into a single representationFalse
mergeInstructionsMerge InstructionsstringAdditional instructions for merging the results to be included in the merge prompt-