Overview
Tagging in Kodexa is a powerful feature that allows you to mark and annotate specific portions of content within your document nodes. Tags can be applied to entire nodes or specific portions of text, and can include additional metadata and relationships between tagged elements.
Tag Structure
A tag in Kodexa consists of the following components:
- Name: The identifier for the tag (e.g., 'name', 'address', 'phone')
- Value: The actual content being tagged
- Start/End Positions: Optional positions within the node's content (if tagging specific text)
- UUID: Unique identifier that can be used to relate multiple tags
- Metadata: Additional data associated with the tag
Tagging Methods
1. Basic Node Tagging
The simplest form of tagging applies a tag to an entire node.
# Basic node tagging
document.content_node.tag('category')
2. Fixed Position Tagging
Tag specific portions of text using start and end positions.
# Tag text from position 6 to 12
document.content_node.tag('name', fixed_position=[6, 12])
3. Regular Expression Tagging
Tag content that matches a specific pattern.
# Tag all email addresses in the content
document.content_node.tag('email', content_re=r'[\\w\\.-]+@[\\w\\.-]+')
4. Node-Only Tagging with Regular Expression
Tag entire nodes that match a pattern.
# Tag nodes that contain a date
document.content_node.tag('date_node', content_re=r'.*\\d{2}/\\d{2}/\\d{4}.*', node_only=True)
Advanced Tagging Features
Tag Groups
Tags can be grouped together using UUIDs to show they are related:
tag_uuid = str(uuid.uuid4())
# Tag multiple related elements with the same UUID
document.content_node.tag('person_name', fixed_position=[0, 10], tag_uuid=tag_uuid)
document.content_node.tag('person_age', fixed_position=[15, 17], tag_uuid=tag_uuid)
Tag Metadata
Additional data can be associated with tags:
document.content_node.tag('address', data={
'type': 'residential',
'verified': True
})
Tag Confidence
You can specify confidence levels for tags:
document.content_node.tag('product_code', confidence=0.95)
Working with Tagged Content
Retrieving Tags
# Get all tags on a node
tags = node.get_tags()
# Get specific tag values
values = node.get_tag_values('address')
# Get related tag values
related_values = node.get_related_tag_values('person')
Removing Tags
# Remove a specific tag
node.remove_tag('category')
# Remove all tags
node.remove_feature('tag', '*')
Tag Instances
Tag instances allow you to group multiple nodes under a single tag:
# Create a tag instance for multiple nodes
nodes = document.select('//address/*')
document.add_tag_instance('address_block', nodes)
Diagrams
Basic Tag Structure
classDiagram
class Tag {
+String name
+String value
+Integer start
+Integer end
+String uuid
+Float confidence
+Dict data
}
class ContentNode {
+String content
+List features
+add_feature()
+tag()
+get_tags()
}
ContentNode "1" --> "*" Tag
Tag Relationships
graph LR
A[Node 1] -- tag_uuid_1 --> B((Tag: Name))
C[Node 2] -- tag_uuid_1 --> D((Tag: Age))
E[Node 3] -- tag_uuid_2 --> F((Tag: Address))
Best Practices
- Use Meaningful Tag Names: Choose descriptive names that reflect the content being tagged.
- Group Related Tags: Use tag_uuid to group related pieces of information.
- Include Confidence: When using automated tagging, include confidence scores.
- Add Metadata: Use the data parameter to store additional context about the tag.
- Consider Scope: Use node_only=True when you want to tag entire nodes rather than specific content.
Common Patterns
Document Classification
# Tag document type based on content
document.content_node.tag('document_type', value='invoice', data={
'confidence': 0.98,
'classifier': 'invoice_classifier_v1'
})
Entity Extraction
# Tag named entities
document.content_node.tag('organization', content_re=r'Microsoft|Google|Apple',
node_only=False)
Form Field Extraction
# Tag form fields with metadata
document.content_node.tag('field', fixed_position=[100, 150], data={
'field_name': 'total_amount',
'field_type': 'currency',
'required': True
})
Error Handling
When working with tags, consider these common issues:
- Position Errors: Ensure fixed positions are within content bounds
- Regular Expression Matching: Test patterns thoroughly
- Node Selection: Verify node existence before tagging
- Content Accessibility: Check content availability before tagging
# Example of safe tagging with error handling
try:
if node.content: # Check if content exists
if len(node.content) >= end_position: # Verify position
node.tag('field', fixed_position=[start_position, end_position])
except Exception as e:
print(f"Tagging error: {str(e)}")
Performance Considerations
- Use
node_only=True
when possible to reduce processing overhead - Batch related tags together using tag_uuid
- Use specific selectors to limit the scope of tagging operations
- Consider using tag instances for large groups of related nodes
Remember that tags are stored as features in the document's persistence layer, so efficient tagging can improve overall document processing performance.
← Previous
Next →
On this page
- Overview
- Tag Structure
- Tagging Methods
- 1. Basic Node Tagging
- 2. Fixed Position Tagging
- 3. Regular Expression Tagging
- 4. Node-Only Tagging with Regular Expression
- Advanced Tagging Features
- Tag Groups
- Tag Metadata
- Tag Confidence
- Working with Tagged Content
- Retrieving Tags
- Removing Tags
- Tag Instances
- Diagrams
- Basic Tag Structure
- Tag Relationships
- Best Practices
- Common Patterns
- Document Classification
- Entity Extraction
- Form Field Extraction
- Error Handling
- Performance Considerations