> ## Documentation Index
> Fetch the complete documentation index at: https://developer.kodexa.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Lake

> The Kodexa data lake mirrors each document's post-processing state into S3-backed content-object envelopes for analytics — including, from 2026.6, a per-revision audit trail under after.audit.

Kodexa streams a document's post-processing state into an S3-backed **data lake** for analytics. Each document's content is written as a **content-object envelope** under the `content-objects/` prefix. An envelope's `After` body holds the projected state:

* **`after.dataObjects`** — the extracted data objects projected from the document's KDDB (keyed by taxonomy reference), i.e. the post-Apply view of the data a reviewer sees.
* **`after.audit`** — *(2026.6+)* the document's full per-revision audit trail (see below).

The analytics [datasets](/api-reference/analytics/get-analytics-datasets) and the [query API](/api-reference/analytics/post-analytics-query) read from these envelopes.

## Document audit trail (`after.audit`)

*(2026.6, lake schema `1.0` → `1.1`)*

Each content-object envelope now carries a co-located **`after.audit`** section inside its `After` body that projects the document's full per-revision KDDB audit trail. There is no separate audit envelope or prefix — `after.audit` lives inside the existing envelope under `content-objects/`, and is omitted when a document has no trail (schema-stable, the same way `after.dataObjects` is).

### What it contains

* **`after.audit.revisions[]`** — every audit revision, each with `auditRevisionId`, `revisionTimestamp` (UTC), and actor attribution: `actorEmail`, `actorUserId`, `batchId`, `taskId`, `taskTemplateRef`, and `userWorkSessionStartedAt`.
* **`after.audit.dataAttributeAudits[]`** — the per-attribute add / edit / delete changes.
* **`after.audit.dataExceptionAudits[]`** — the data-exception lifecycle (raise / modify / resolve).
* **`dataObjectAudits`**, **`tagAudits`**, and **`metadataAudits`** are also projected for completeness.

### Before and after values

Every attribute-change row in `dataAttributeAudits[]` carries a `transactionType` of `add`, `edit`, or `delete`, and presents its values uniformly: the current value in `value` / `stringValue` / `decimalValue` / etc., and the prior value in the matching `previousValue` / `previousStringValue` / `previousOwnerUri` fields. Deletes are normalized so the removed value always appears in the `previous*` fields (never as a current value) — so you can read before/after the same way for any change type. Revision attribution (actor, task, session) is denormalized onto each row, so no join back to `revisions[]` is needed.

### Leaf grain

Rows are unique at `(document_family_id, auditRevisionId, id)`. The audit trail is a property of the document **family**: the platform mints new content-object ids over a document's life but copies the trail forward with stable revision ids, so every content-object envelope for a family carries the cumulative trail as of its version. Dedup on `(document_family_id, auditRevisionId, id)` to assemble the complete trail across a family's content objects without double-counting.

### Schema bump and backfill

This is a purely **additive** `1.0` → `1.1` schema change — existing envelopes and queries are unaffected, and consumers that don't read `after.audit` see no change. Envelopes written before 2026.6 do not have `after.audit` and are not rewritten in place; they can be retro-fitted with the **`lake-backfill-audit`** utility, which re-projects each document's trail from KDDB and splices it into the existing envelope. Backfill is idempotent — a re-run skips envelopes that already carry `after.audit` (unless forced) — and leaves the rest of the envelope semantically unchanged (row state, `dataObjects`, and knowledge items are all preserved; only `after.audit` is spliced in).
