Document Tagging
Document Tagging and Struture
Overview
Tagging in Kodexa is a powerful feature that allows you to mark and annotate specific portions of content within your document nodes. Tags can be applied to entire nodes or specific portions of text, and can include additional metadata and relationships between tagged elements.
Tag Structure
A tag in Kodexa consists of the following components:
- Name: The identifier for the tag (e.g., ‘name’, ‘address’, ‘phone’)
- Value: The actual content being tagged
- Start/End Positions: Optional positions within the node’s content (if tagging specific text)
- UUID: Unique identifier that can be used to relate multiple tags
- Metadata: Additional data associated with the tag
Tagging Methods
1. Basic Node Tagging
The simplest form of tagging applies a tag to an entire node.
2. Fixed Position Tagging
Tag specific portions of text using start and end positions.
3. Regular Expression Tagging
Tag content that matches a specific pattern.
4. Node-Only Tagging with Regular Expression
Tag entire nodes that match a pattern.
Advanced Tagging Features
Tag Groups
Tags can be grouped together using UUIDs to show they are related:
Tag Metadata
Additional data can be associated with tags:
Tag Confidence
You can specify confidence levels for tags:
Working with Tagged Content
Retrieving Tags
Removing Tags
Tag Instances
Tag instances allow you to group multiple nodes under a single tag:
Diagrams
Basic Tag Structure
Tag Relationships
Best Practices
- Use Meaningful Tag Names: Choose descriptive names that reflect the content being tagged.
- Group Related Tags: Use tag_uuid to group related pieces of information.
- Include Confidence: When using automated tagging, include confidence scores.
- Add Metadata: Use the data parameter to store additional context about the tag.
- Consider Scope: Use node_only=True when you want to tag entire nodes rather than specific content.
Common Patterns
Document Classification
Entity Extraction
Form Field Extraction
Error Handling
When working with tags, consider these common issues:
- Position Errors: Ensure fixed positions are within content bounds
- Regular Expression Matching: Test patterns thoroughly
- Node Selection: Verify node existence before tagging
- Content Accessibility: Check content availability before tagging
Performance Considerations
- Use
node_only=True
when possible to reduce processing overhead - Batch related tags together using tag_uuid
- Use specific selectors to limit the scope of tagging operations
- Consider using tag instances for large groups of related nodes
Remember that tags are stored as features in the document’s persistence layer, so efficient tagging can improve overall document processing performance.