Skip to main content

Overview

Tagging in Kodexa is a powerful feature that allows you to mark and annotate specific portions of content within your document nodes. Tags can be applied to entire nodes or specific portions of text, and can include additional metadata and relationships between tagged elements.

Tag Structure

A tag in Kodexa consists of the following components:
  • Name: The identifier for the tag (e.g., ‘name’, ‘address’, ‘phone’)
  • Value: The actual content being tagged
  • Start/End Positions: Optional positions within the node’s content (if tagging specific text)
  • UUID: Unique identifier for the tag instance
  • Confidence: A score between 0 and 1 indicating tagging certainty
  • Group UUID: Links related tags together
  • Data: Additional JSON-serializable metadata
  • Owner URI: Identifies the source that created the tag (e.g., a model reference)

Tagging Methods

1. Basic Node Tagging

The simplest form of tagging applies a tag to an entire node:
# Basic node tagging
node.tag('category')

2. Fixed Position Tagging

Tag specific portions of text using start and end positions:
# Tag text from position 6 to 12
node.tag('name', fixed_position=[6, 12])

3. Regular Expression Tagging

Tag content that matches a specific pattern (Python):
# Tag all email addresses in the content
node.tag('email', content_re=r'[\w\.-]+@[\w\.-]+')

Advanced Tagging Features

Tag Groups

Tags can be grouped together using group UUIDs to show they are related:
import uuid

group_id = str(uuid.uuid4())
# Tag multiple related elements with the same group UUID
node.tag('person_name', fixed_position=[0, 10], group_uuid=group_id)
node.tag('person_age', fixed_position=[15, 17], group_uuid=group_id)

Tag Metadata

Additional data can be associated with tags (Python):
node.tag('address', data={
    'type': 'residential',
    'verified': True
})

Tag Confidence

You can specify confidence levels for tags:
node.tag('product_code', confidence=0.95)

Tag Owner URI

Identify the source that created a tag:
node.tag('invoice_number', owner_uri='model://kodexa/invoice-extractor:1.0.0')

Working with Tagged Content

Retrieving Tags

# Get all tags on a node
tags = node.get_tags()

# Iterate over tags
for tag in tags:
    print(f"Name: {tag.uuid}, Value: {tag.value}, Confidence: {tag.confidence}")

# Check if a node has a specific tag
if node.has_tag('address'):
    print("Node has address tag")

# Get tag names
tag_names = node.get_tag_names()

Removing Tags

# Remove a specific tag by name
node.remove_tag('category')

Tag Instances

Tag instances allow you to group multiple nodes under a single tag. This is useful when a piece of information spans multiple nodes:
# Create a tag instance spanning multiple nodes
nodes = document.select('//line')
document.add_tag_instance('address_block', nodes)
This tags all the selected nodes with the same tag name and the same UUID, linking them together as a group.

Finding Tagged Nodes

You can use selectors to find nodes with specific tags:
# Find all nodes with a specific tag
tagged_nodes = document.select("//*[hasTag('company_name')]")

# Get all nodes with a specific tag
all_tagged = document.select("//*[hasTag()]")

Diagrams

Basic Tag Structure

Tag Relationships

Best Practices

  1. Use Meaningful Tag Names: Choose descriptive names that reflect the content being tagged.
  2. Group Related Tags: Use group_uuid (Python) or groupId (TypeScript) to group related pieces of information.
  3. Include Confidence: When using automated tagging, include confidence scores.
  4. Add Metadata: Use the data parameter to store additional context about the tag.
  5. Set Owner URI: When tagging from models or automated processes, set the owner_uri to track the tag source.

Common Patterns

Document Classification

# Tag document type based on content
node.tag('document_type', value='invoice', data={
    'confidence': 0.98,
    'classifier': 'invoice_classifier_v1'
})

Entity Extraction

# Tag named entities using regex
node.tag('organization', content_re=r'Microsoft|Google|Apple')

Form Field Extraction

# Tag form fields with metadata
node.tag('field', fixed_position=[100, 150], data={
    'field_name': 'total_amount',
    'field_type': 'currency',
    'required': True
})

Tag Options Reference

The tag() method in Python accepts these keyword arguments:
OptionTypeDescription
content_restrRegular expression to match content
fixed_positionlist[start, end] positions in content
tag_uuidstrUUID for the tag instance
group_uuidstrUUID to group related tags
parent_group_uuidstrParent group UUID for hierarchical grouping
confidencefloatConfidence score (0-1)
valuestrTagged value
datadictAdditional metadata
cell_indexintCell index for table structures
owner_uristrSource identifier for the tag
In TypeScript, use tagWithOptions(name, options) with the TagOptions interface:
OptionTypeDescription
startnumberStart position in content
endnumberEnd position in content
confidencenumberConfidence score (0-1)
groupIdnumberGroup ID for related tags
parentGroupIdnumberParent group ID
cellIndexnumberCell index for table structures

Error Handling

When working with tags, consider these common issues:
  1. Position Errors: Ensure fixed positions are within content bounds
  2. Regular Expression Matching: Test patterns thoroughly
  3. Node Selection: Verify node existence before tagging
  4. Content Accessibility: Check content availability before tagging
# Example of safe tagging with error handling
try:
    if node.content:  # Check if content exists
        if len(node.content) >= end_position:  # Verify position
            node.tag('field', fixed_position=[start_position, end_position])
except Exception as e:
    print(f"Tagging error: {str(e)}")

Performance Considerations

  1. Batch related tags together using group_uuid
  2. Use specific selectors to limit the scope of tagging operations
  3. Consider using tag instances for large groups of related nodes
  4. Use transactions when performing many tag operations together
Tags are stored in the document’s KDDB persistence layer, so efficient tagging practices improve overall document processing performance.