Document Tagging

Overview

Tagging in Kodexa is a powerful feature that allows you to mark and annotate specific portions of content within your document nodes. Tags can be applied to entire nodes or specific portions of text, and can include additional metadata and relationships between tagged elements.

Tag Structure

A tag in Kodexa consists of the following components:

Name: The identifier for the tag (e.g., ‘name’, ‘address’, ‘phone’)
Value: The actual content being tagged
Start/End Positions: Optional positions within the node’s content (if tagging specific text)
UUID: Unique identifier that can be used to relate multiple tags
Metadata: Additional data associated with the tag

Tagging Methods

1. Basic Node Tagging

The simplest form of tagging applies a tag to an entire node.

# Basic node tagging
document.content_node.tag('category')

2. Fixed Position Tagging

Tag specific portions of text using start and end positions.

# Tag text from position 6 to 12
document.content_node.tag('name', fixed_position=[6, 12])

3. Regular Expression Tagging

Tag content that matches a specific pattern.

# Tag all email addresses in the content
document.content_node.tag('email', content_re=r'[\\w\\.-]+@[\\w\\.-]+')

4. Node-Only Tagging with Regular Expression

Tag entire nodes that match a pattern.

# Tag nodes that contain a date
document.content_node.tag('date_node', content_re=r'.*\\d{2}/\\d{2}/\\d{4}.*', node_only=True)

Advanced Tagging Features

Tag Groups

Tags can be grouped together using UUIDs to show they are related:

tag_uuid = str(uuid.uuid4())
# Tag multiple related elements with the same UUID
document.content_node.tag('person_name', fixed_position=[0, 10], tag_uuid=tag_uuid)
document.content_node.tag('person_age', fixed_position=[15, 17], tag_uuid=tag_uuid)

Tag Metadata

Additional data can be associated with tags:

document.content_node.tag('address', data={
    'type': 'residential',
    'verified': True
})

Tag Confidence

You can specify confidence levels for tags:

document.content_node.tag('product_code', confidence=0.95)

Working with Tagged Content

Retrieving Tags

# Get all tags on a node
tags = node.get_tags()

# Get specific tag values
values = node.get_tag_values('address')

# Get related tag values
related_values = node.get_related_tag_values('person')

Removing Tags

# Remove a specific tag
node.remove_tag('category')

# Remove all tags
node.remove_feature('tag', '*')

Tag Instances

Tag instances allow you to group multiple nodes under a single tag:

# Create a tag instance for multiple nodes
nodes = document.select('//address/*')
document.add_tag_instance('address_block', nodes)

Diagrams

Basic Tag Structure

Tag Relationships

Best Practices

Use Meaningful Tag Names: Choose descriptive names that reflect the content being tagged.
Group Related Tags: Use tag_uuid to group related pieces of information.
Include Confidence: When using automated tagging, include confidence scores.
Add Metadata: Use the data parameter to store additional context about the tag.
Consider Scope: Use node_only=True when you want to tag entire nodes rather than specific content.

Common Patterns

Document Classification

# Tag document type based on content
document.content_node.tag('document_type', value='invoice', data={
    'confidence': 0.98,
    'classifier': 'invoice_classifier_v1'
})

Entity Extraction

# Tag named entities
document.content_node.tag('organization', content_re=r'Microsoft|Google|Apple',
                         node_only=False)

Form Field Extraction

# Tag form fields with metadata
document.content_node.tag('field', fixed_position=[100, 150], data={
    'field_name': 'total_amount',
    'field_type': 'currency',
    'required': True
})

Error Handling

When working with tags, consider these common issues:

Position Errors: Ensure fixed positions are within content bounds
Regular Expression Matching: Test patterns thoroughly
Node Selection: Verify node existence before tagging
Content Accessibility: Check content availability before tagging

# Example of safe tagging with error handling
try:
    if node.content:  # Check if content exists
        if len(node.content) >= end_position:  # Verify position
            node.tag('field', fixed_position=[start_position, end_position])
except Exception as e:
    print(f"Tagging error: {str(e)}")

Performance Considerations

Use node_only=True when possible to reduce processing overhead
Batch related tags together using tag_uuid
Use specific selectors to limit the scope of tagging operations
Consider using tag instances for large groups of related nodes

Remember that tags are stored as features in the document’s persistence layer, so efficient tagging can improve overall document processing performance.

Introduction

Organization & Projects

Resources

Modules

Overview

Tag Structure

Tagging Methods

1. Basic Node Tagging

2. Fixed Position Tagging

3. Regular Expression Tagging

4. Node-Only Tagging with Regular Expression

Advanced Tagging Features

Tag Groups

Tag Metadata

Tag Confidence

Working with Tagged Content

Retrieving Tags

Removing Tags

Tag Instances

Diagrams

Basic Tag Structure

Tag Relationships

Best Practices

Common Patterns

Document Classification

Entity Extraction

Form Field Extraction

Error Handling

Performance Considerations

Introduction

Organization & Projects

Resources

Modules

​Overview

​Tag Structure

​Tagging Methods

​1. Basic Node Tagging

​2. Fixed Position Tagging

​3. Regular Expression Tagging

​4. Node-Only Tagging with Regular Expression

​Advanced Tagging Features

​Tag Groups

​Tag Metadata

​Tag Confidence

​Working with Tagged Content

​Retrieving Tags

​Removing Tags

​Tag Instances

​Diagrams

​Basic Tag Structure

​Tag Relationships

​Best Practices

​Common Patterns

​Document Classification

​Entity Extraction

​Form Field Extraction

​Error Handling

​Performance Considerations

Overview

Tag Structure

Tagging Methods

1. Basic Node Tagging

2. Fixed Position Tagging

3. Regular Expression Tagging

4. Node-Only Tagging with Regular Expression

Advanced Tagging Features

Tag Groups

Tag Metadata

Tag Confidence

Working with Tagged Content

Retrieving Tags

Removing Tags

Tag Instances

Diagrams

Basic Tag Structure

Tag Relationships

Best Practices

Common Patterns

Document Classification

Entity Extraction

Form Field Extraction

Error Handling

Performance Considerations