Search

Document Stores

Document Stores

Document stores are designed to hold the original documents that are used to extract data. They are designed to be a repository of documents that can be used to train models and extract data.

A document store holds what we call Document Families. These are logical containers that relate both the original file and any of the derived documents that are created from it.

Store Purposes

There are two main purposes for a document store:

  • To hold documents that we will be using for training models
  • To hold documents that we will be using to extract data

On the store object we have a storePurpose property that can be set to either TRAINING or OPERATIONAL. This is used to determine which documents are available for use in the store. The actual functionality of the store itself is the same regardless of the purpose.

Anatomy of a Document Family

A document family consists of a document and any of the derived documents that are created from it. Since a document family can contain both a native PDF and also the Kodexa Documents derived from it, we have a stereotype we call a content object. A content object points to something that contains content. This can be a file or a document, the content type on the content object is then either 'Document' or 'Native'. In this case 'Native' means the original file, since it could be of any file type.

The document family holds the list of content objects and also a concept called "Document Transitions". A document transition is a link between two content objects that shows how a content object was derived from another content object, and which assistant (or user) was responsible for the derivation.

Store Options

The document store has a number of options that can be set to control how it behaves. These are set on the store object and are:

  • highQualityPreview - If set to true then the store will generate high quality previews of the documents. This will increase the time it takes to generate the previews but will result in better quality previews. The default value is false. This setting is used in the UI.
  • searchable - If set to true then the store will be searchable. This means that the platform will pass content from document to indexing.
  • deleteProtection - If set to true then the store will be protected from deletion. This means that you can't delete the store or delete all its contents. However, you can still delete documents from the store.

Document Properties

You can specify document properties, these will be shown to the user using the options when they are uploading a file to the document store.

This is a good way to capture information in the document family metadata that you can use later.

documentProperties:
  - type: string
    label: Customer ID
    name: CustomerID
    required: true

You can combine these with the label expression you will see in the next section to automatically as document tagging.

labelExpressions:
  - expression: "['CustomerID']"

Expression Labels

When a document (either a native file or a Kodexa document) is added to a Document store, we want to have the ability to determine if we want to add a label to it. This can be achieved with Label Expressions.

A label expression allows you to, on a document store, add a specific label to the new document based on the results of an expression. The expression itself is actually a Spring Expression Language (https://docs.spring.io/spring-framework/docs/3.2.x/spring-framework-reference/html/expressions.html) expression.

This can allow for a use-case where the application that is uploading the document to the platform can include metadata with the upload. This metadata (as well as the document and document family) are then available for the expression to use.

Let’s say we have an application that is uploading documents to an instance of Kodexa. When the upload is associating a value in metadata called “ShouldPublishXml”, the value can be True or False. As we load the document into the document store, we want to determine if this metadata flag is present, and if it is there and not set to True we want to add a label dont_publish to the document. In order to do this, we will want to create a label expression at the document store level that has properties:

label: dont_publish

expression:

containsKey('ShouldProcessXML') && ['ShouldProcessXML'].toLowerCase() != 'true'

This expression will then be evaluated - if the expression returns not True (not case-sensitive), then we will add the label. If the expression returns a string value then we will use this as the name of the label, for example lets say we wanted to add a label that was the value of the metadata field available on upload called 'CustomerName'. We would use the expression:

containsKey('CustomerName') ? ['CustomerName'] : null

Expression Labels are part of the Store Metadata, this is available at:

/api/ stores / { organizationSlug } / { storeSlug } / metadata

File Upload API Documentation

This documentation demonstrates how to upload files to the document store using different programming languages and tools.

API Endpoint

POST /api/stores/{org-slug}/{store-slug}/{store-version}/fs

The endpoint accepts multipart form data with the following parameters:

  • path: The target path for the uploaded file (query parameter)
  • file: The file content
  • document: (Optional) Document metadata in KDDB format
  • Additional metadata can be included as form fields

Store Reference Format

The store reference follows the format: <org-slug>/<store-slug>/<store-version>

For example: demo-org/my-store/1.0.0

Authentication

All requests must include the X-ACCESS-TOKEN header with a valid access token:

X-ACCESS-TOKEN: YOUR_ACCESS_TOKEN

Examples

cURL/Bash

Basic file upload:

curl -X POST "<https://api.example.com/api/stores/demo-org/my-store/1.0.0/fs?path=invoice.pdf>" \\
  -H "X-ACCESS-TOKEN: YOUR_ACCESS_TOKEN" \\
  -F "file=@/path/to/local/invoice.pdf"

Upload with additional metadata:

curl -X POST "<https://api.example.com/api/stores/demo-org/my-store/1.0.0/fs?path=invoice.pdf>" \\
  -H "X-ACCESS-TOKEN: YOUR_ACCESS_TOKEN" \\
  -F "file=@/path/to/local/invoice.pdf" \\
  -F "customerName=Acme Corp" \\
  -F "invoiceNumber=INV-2024-001"

JavaScript

Using the Fetch API:

async function uploadFile(storeRef, filePath, fileContent) {
  const formData = new FormData();
  formData.append('file', fileContent);

  // storeRef format: "demo-org/my-store/1.0.0"
  const response = await fetch(`https://api.example.com/api/stores/${storeRef}/fs?path=${filePath}`, {
    method: 'POST',
    headers: {
      'X-ACCESS-TOKEN': 'YOUR_ACCESS_TOKEN'
    },
    body: formData
  });

  if (!response.ok) {
    throw new Error(`Upload failed: ${response.statusText}`);
  }

  return response.json();
}

// Example usage with file input
document.querySelector('input[type="file"]').addEventListener('change', async (e) => {
  const file = e.target.files[0];
  try {
    const result = await uploadFile('demo-org/my-store/1.0.0', file.name, file);
    console.log('Upload successful:', result);
  } catch (error) {
    console.error('Upload failed:', error);
  }
});

C#

Using HttpClient:

using System.Net.Http;

public class DocumentStoreClient
{
    private readonly HttpClient _client;
    private readonly string _baseUrl;

    public DocumentStoreClient(string baseUrl, string accessToken)
    {
        _client = new HttpClient();
        _baseUrl = baseUrl;
        _client.DefaultRequestHeaders.Add("X-ACCESS-TOKEN", accessToken);
    }

    public async Task<string> UploadFileAsync(
        string storeRef,
        string filePath,
        string targetPath,
        Dictionary<string, string> metadata = null)
    {
        using var form = new MultipartFormDataContent();
        using var fileStream = File.OpenRead(filePath);
        var fileContent = new StreamContent(fileStream);
        form.Add(fileContent, "file", Path.GetFileName(filePath));

        // Add optional metadata
        if (metadata != null)
        {
            foreach (var item in metadata)
            {
                form.Add(new StringContent(item.Value), item.Key);
            }
        }

        var url = $"{_baseUrl}/api/stores/{storeRef}/fs?path={targetPath}";
        var response = await _client.PostAsync(url, form);
        response.EnsureSuccessStatusCode();

        return await response.Content.ReadAsStringAsync();
    }
}

// Example usage
async Task UploadExample()
{
    var client = new DocumentStoreClient(
        "<https://api.example.com>",
        "YOUR_ACCESS_TOKEN"
    );

    var metadata = new Dictionary<string, string>
    {
        { "customerName", "Acme Corp" },
        { "invoiceNumber", "INV-2024-001" }
    };

    try
    {
        var result = await client.UploadFileAsync(
            "demo-org/my-store/1.0.0",
            @"C:\\invoices\\invoice.pdf",
            "invoice.pdf",
            metadata
        );
        Console.WriteLine($"Upload successful: {result}");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"Upload failed: {ex.Message}");
    }
}

Response

A successful upload returns a JSON response containing the document family details:

{
  "id": "doc123",
  "path": "invoice.pdf",
  "created": "2024-03-21T10:30:00Z",
  "metadata": {
    "customerName": "Acme Corp",
    "invoiceNumber": "INV-2024-001"
  }
}

Error Handling

The API uses standard HTTP status codes:

  • 200: Success
  • 400: Bad Request (invalid parameters)
  • 401: Unauthorized (invalid access token)
  • 409: Conflict (file already exists when replace=false)
  • 500: Server Error

Each error response includes a JSON body with error details:

{
  "error": "File already exists",
  "code": "DUPLICATE_FILE",
  "path": "invoice.pdf"
}

← Previous

Documents and Data

Next →

Data Stores