Skip to main content
Kodexa streams a document’s post-processing state into an S3-backed data lake for analytics. Each document’s content is written as a content-object envelope under the content-objects/ prefix. An envelope’s After body holds the projected state:
  • after.dataObjects — the extracted data objects projected from the document’s KDDB (keyed by taxonomy reference), i.e. the post-Apply view of the data a reviewer sees.
  • after.audit(2026.6+) the document’s full per-revision audit trail (see below).
The analytics datasets and the query API read from these envelopes.

Document audit trail (after.audit)

(2026.6, lake schema 1.01.1) Each content-object envelope now carries a co-located after.audit section inside its After body that projects the document’s full per-revision KDDB audit trail. There is no separate audit envelope or prefix — after.audit lives inside the existing envelope under content-objects/, and is omitted when a document has no trail (schema-stable, the same way after.dataObjects is).

What it contains

  • after.audit.revisions[] — every audit revision, each with auditRevisionId, revisionTimestamp (UTC), and actor attribution: actorEmail, actorUserId, batchId, taskId, taskTemplateRef, and userWorkSessionStartedAt.
  • after.audit.dataAttributeAudits[] — the per-attribute add / edit / delete changes.
  • after.audit.dataExceptionAudits[] — the data-exception lifecycle (raise / modify / resolve).
  • dataObjectAudits, tagAudits, and metadataAudits are also projected for completeness.

Before and after values

Every attribute-change row in dataAttributeAudits[] carries a transactionType of add, edit, or delete, and presents its values uniformly: the current value in value / stringValue / decimalValue / etc., and the prior value in the matching previousValue / previousStringValue / previousOwnerUri fields. Deletes are normalized so the removed value always appears in the previous* fields (never as a current value) — so you can read before/after the same way for any change type. Revision attribution (actor, task, session) is denormalized onto each row, so no join back to revisions[] is needed.

Leaf grain

Rows are unique at (document_family_id, auditRevisionId, id). The audit trail is a property of the document family: the platform mints new content-object ids over a document’s life but copies the trail forward with stable revision ids, so every content-object envelope for a family carries the cumulative trail as of its version. Dedup on (document_family_id, auditRevisionId, id) to assemble the complete trail across a family’s content objects without double-counting.

Schema bump and backfill

This is a purely additive 1.01.1 schema change — existing envelopes and queries are unaffected, and consumers that don’t read after.audit see no change. Envelopes written before 2026.6 do not have after.audit and are not rewritten in place; they can be retro-fitted with the lake-backfill-audit utility, which re-projects each document’s trail from KDDB and splices it into the existing envelope. Backfill is idempotent — a re-run skips envelopes that already carry after.audit (unless forced) — and leaves the rest of the envelope semantically unchanged (row state, dataObjects, and knowledge items are all preserved; only after.audit is spliced in).