Search

Document Tagging

Document Tagging

Overview

Tagging in Kodexa is a powerful feature that allows you to mark and annotate specific portions of content within your document nodes. Tags can be applied to entire nodes or specific portions of text, and can include additional metadata and relationships between tagged elements.

Tag Structure

A tag in Kodexa consists of the following components:

  • Name: The identifier for the tag (e.g., 'name', 'address', 'phone')
  • Value: The actual content being tagged
  • Start/End Positions: Optional positions within the node's content (if tagging specific text)
  • UUID: Unique identifier that can be used to relate multiple tags
  • Metadata: Additional data associated with the tag

Tagging Methods

1. Basic Node Tagging

The simplest form of tagging applies a tag to an entire node.

# Basic node tagging
document.content_node.tag('category')

2. Fixed Position Tagging

Tag specific portions of text using start and end positions.

# Tag text from position 6 to 12
document.content_node.tag('name', fixed_position=[6, 12])

3. Regular Expression Tagging

Tag content that matches a specific pattern.

# Tag all email addresses in the content
document.content_node.tag('email', content_re=r'[\\w\\.-]+@[\\w\\.-]+')

4. Node-Only Tagging with Regular Expression

Tag entire nodes that match a pattern.

# Tag nodes that contain a date
document.content_node.tag('date_node', content_re=r'.*\\d{2}/\\d{2}/\\d{4}.*', node_only=True)

Advanced Tagging Features

Tag Groups

Tags can be grouped together using UUIDs to show they are related:

tag_uuid = str(uuid.uuid4())
# Tag multiple related elements with the same UUID
document.content_node.tag('person_name', fixed_position=[0, 10], tag_uuid=tag_uuid)
document.content_node.tag('person_age', fixed_position=[15, 17], tag_uuid=tag_uuid)

Tag Metadata

Additional data can be associated with tags:

document.content_node.tag('address', data={
    'type': 'residential',
    'verified': True
})

Tag Confidence

You can specify confidence levels for tags:

document.content_node.tag('product_code', confidence=0.95)

Working with Tagged Content

Retrieving Tags

# Get all tags on a node
tags = node.get_tags()

# Get specific tag values
values = node.get_tag_values('address')

# Get related tag values
related_values = node.get_related_tag_values('person')

Removing Tags

# Remove a specific tag
node.remove_tag('category')

# Remove all tags
node.remove_feature('tag', '*')

Tag Instances

Tag instances allow you to group multiple nodes under a single tag:

# Create a tag instance for multiple nodes
nodes = document.select('//address/*')
document.add_tag_instance('address_block', nodes)

Diagrams

Basic Tag Structure

classDiagram
    class Tag {
        +String name
        +String value
        +Integer start
        +Integer end
        +String uuid
        +Float confidence
        +Dict data
    }

    class ContentNode {
        +String content
        +List features
        +add_feature()
        +tag()
        +get_tags()
    }

    ContentNode "1" --> "*" Tag

Tag Relationships

graph LR
    A[Node 1] -- tag_uuid_1 --> B((Tag: Name))
    C[Node 2] -- tag_uuid_1 --> D((Tag: Age))
    E[Node 3] -- tag_uuid_2 --> F((Tag: Address))

Best Practices

  1. Use Meaningful Tag Names: Choose descriptive names that reflect the content being tagged.
  2. Group Related Tags: Use tag_uuid to group related pieces of information.
  3. Include Confidence: When using automated tagging, include confidence scores.
  4. Add Metadata: Use the data parameter to store additional context about the tag.
  5. Consider Scope: Use node_only=True when you want to tag entire nodes rather than specific content.

Common Patterns

Document Classification

# Tag document type based on content
document.content_node.tag('document_type', value='invoice', data={
    'confidence': 0.98,
    'classifier': 'invoice_classifier_v1'
})

Entity Extraction

# Tag named entities
document.content_node.tag('organization', content_re=r'Microsoft|Google|Apple',
                         node_only=False)

Form Field Extraction

# Tag form fields with metadata
document.content_node.tag('field', fixed_position=[100, 150], data={
    'field_name': 'total_amount',
    'field_type': 'currency',
    'required': True
})

Error Handling

When working with tags, consider these common issues:

  1. Position Errors: Ensure fixed positions are within content bounds
  2. Regular Expression Matching: Test patterns thoroughly
  3. Node Selection: Verify node existence before tagging
  4. Content Accessibility: Check content availability before tagging
# Example of safe tagging with error handling
try:
    if node.content:  # Check if content exists
        if len(node.content) >= end_position:  # Verify position
            node.tag('field', fixed_position=[start_position, end_position])
except Exception as e:
    print(f"Tagging error: {str(e)}")

Performance Considerations

  1. Use node_only=True when possible to reduce processing overhead
  2. Batch related tags together using tag_uuid
  3. Use specific selectors to limit the scope of tagging operations
  4. Consider using tag instances for large groups of related nodes

Remember that tags are stored as features in the document's persistence layer, so efficient tagging can improve overall document processing performance.