The kdx document print and kdx document select commands help you understand and navigate the hierarchical structure of KDDB documents.
Print Command
Pretty print the document structure as an ASCII tree.
kdx document print <file.kddb> [flags]
Flags
| Flag | Description | Default |
|---|
--page N | Print only page N (1-indexed) | All pages |
--depth N | Maximum tree depth | 0 (unlimited) |
--features | Show node features | false |
Basic Usage
kdx document print invoice.kddb
root
└── page
└── content-area
├── line
│ └── word: Invoice
├── line
│ ├── word: Date:
│ └── word: 2024-01-15
├── paragraph
│ ├── line
│ │ └── word: Item
│ └── line
│ └── word: Description
└── line
├── word: Total:
└── word: $1,234.56
Limit Depth
Control how deep into the tree to display:
# Show only top 2 levels
kdx document print invoice.kddb --depth 2
root
└── page
└── content-area
Print Specific Page
Focus on a single page:
# Print page 1 only
kdx document print invoice.kddb --page 1
# Print page 3 with limited depth
kdx document print invoice.kddb --page 3 --depth 4
Show Features
Display node features like bounding boxes and tags:
kdx document print invoice.kddb --features --depth 2
root
└── page
[spatial:bbox=0,0,612,792]
└── content-area
[spatial:bbox=50,50,562,742]
Select Command
Query nodes by type using selector syntax.
kdx document select <file.kddb> <selector>
Selector Syntax
Currently supports type-based selectors:
| Selector | Description |
|---|
//type | Select all nodes of the given type |
Examples
# Find all paragraphs
kdx document select doc.kddb "//paragraph"
┌────┬───────────┬──────────────────────────────────────┬───────┐
│ ID │ TYPE │ CONTENT │ INDEX │
├────┼───────────┼──────────────────────────────────────┼───────┤
│ 15 │ paragraph │ This is the first paragraph of text. │ 0 │
│ 28 │ paragraph │ Another important paragraph here. │ 1 │
│ 45 │ paragraph │ The final paragraph with conclusion. │ 2 │
└────┴───────────┴──────────────────────────────────────┴───────┘
# Find all pages
kdx document select doc.kddb "//page"
# Find all lines
kdx document select doc.kddb "//line"
# Find all words
kdx document select doc.kddb "//word"
# Find all tables
kdx document select doc.kddb "//table"
Output Columns
| Column | Description |
|---|
ID | Internal node ID |
TYPE | Node type name |
CONTENT | Node text content (truncated to 80 chars) |
INDEX | Position among siblings |
Both commands support the global -o flag:
# JSON output
kdx document select invoice.kddb "//paragraph" -o json
# YAML output
kdx document print invoice.kddb -o yaml
JSON select output:
[
{
"id": 15,
"type": "paragraph",
"content": "This is the first paragraph of text.",
"index": 0
},
{
"id": 28,
"type": "paragraph",
"content": "Another important paragraph here.",
"index": 1
}
]
Common Node Types
Documents typically contain these node types:
| Type | Description |
|---|
root | Document root node |
page | Page container |
content-area | Content region on a page |
paragraph | Paragraph of text |
line | Line of text |
word | Individual word |
table | Table structure |
table-row | Table row |
table-cell | Table cell |
image | Image placeholder |
Use Cases
Document Analysis
Understand document structure before processing:
# Quick overview
kdx document print invoice.kddb --depth 3
# Count paragraphs
kdx document select invoice.kddb "//paragraph" -o json | jq 'length'
# List all page numbers
kdx document select invoice.kddb "//page" -o json | jq '.[].index'
Investigate why extraction isn’t finding expected content:
# Check if lines exist
kdx document select problematic.kddb "//line"
# View full structure with features
kdx document print problematic.kddb --features
# Check specific page
kdx document print problematic.kddb --page 2 --features
Content Location
Find where specific content appears:
# Find all words (useful for searching)
kdx document select doc.kddb "//word" -o json | jq '.[] | select(.content | contains("Invoice"))'
Structure Validation
Verify document structure is as expected:
# Check page count
PAGE_COUNT=$(kdx document select doc.kddb "//page" -o json | jq 'length')
echo "Document has $PAGE_COUNT pages"
# Verify table structure exists
TABLES=$(kdx document select doc.kddb "//table" -o json | jq 'length')
if [ "$TABLES" -gt 0 ]; then
echo "Found $TABLES tables"
fi
Batch Analysis
Analyze structure across multiple documents:
# Report structure summary for all documents
for kddb in *.kddb; do
PAGES=$(kdx document select "$kddb" "//page" -o json 2>/dev/null | jq 'length')
PARAGRAPHS=$(kdx document select "$kddb" "//paragraph" -o json 2>/dev/null | jq 'length')
echo "$kddb: $PAGES pages, $PARAGRAPHS paragraphs"
done
Tips
Use --depth 2 or --depth 3 for a quick overview without overwhelming output on large documents.
Pipe JSON output to jq for powerful filtering and transformation of results.
Content shown in the tree view is truncated to 60 characters. Use select with JSON output for full content.