Document stores are designed to hold the original documents that are used to extract data. They are designed to be a repository of documents that can be used to train models and extract data.
A document store holds what we call Document Families. These are logical containers that relate both the original file and any of the derived documents that are created from it.
Store Purposes
There are two main purposes for a document store:
- To hold documents that we will be using for training models
- To hold documents that we will be using to extract data
On the store object we have a storePurpose
property that can be set to either TRAINING
or OPERATIONAL
. This is used to determine which documents are available for use in the store. The actual functionality of the store itself is the same regardless of the purpose.
Anatomy of a Document Family
A document family consists of a document and any of the derived documents that are created from it. Since a document family can contain both a native PDF and also the Kodexa Documents derived from it, we have a stereotype we call a content object. A content object points to something that contains content. This can be a file or a document, the content type on the content object is then either 'Document' or 'Native'. In this case 'Native' means the original file, since it could be of any file type.
The document family holds the list of content objects and also a concept called "Document Transitions". A document transition is a link between two content objects that shows how a content object was derived from another content object, and which assistant (or user) was responsible for the derivation.
Store Options
The document store has a number of options that can be set to control how it behaves. These are set on the store object and are:
highQualityPreview
- If set totrue
then the store will generate high quality previews of the documents. This will increase the time it takes to generate the previews but will result in better quality previews. The default value isfalse
. This setting is used in the UI.searchable
- If set totrue
then the store will be searchable. This means that the platform will pass content from document to indexing.deleteProtection
- If set totrue
then the store will be protected from deletion. This means that you can't delete the store or delete all its contents. However, you can still delete documents from the store.
Document Properties
You can specify document properties, these will be shown to the user using the options when they are uploading a file to the document store.
This is a good way to capture information in the document family metadata that you can use later.
documentProperties:
- type: string
label: Customer ID
name: CustomerID
required: true
You can combine these with the label expression you will see in the next section to automatically as document tagging.
labelExpressions:
- expression: "['CustomerID']"
Expression Labels
When a document (either a native file or a Kodexa document) is added to a Document store, we want to have the ability to determine if we want to add a label to it. This can be achieved with Label Expressions.
A label expression allows you to, on a document store, add a specific label to the new document based on the results of an expression. The expression itself is actually a Spring Expression Language (https://docs.spring.io/spring-framework/docs/3.2.x/spring-framework-reference/html/expressions.html) expression.
This can allow for a use-case where the application that is uploading the document to the platform can include metadata with the upload. This metadata (as well as the document and document family) are then available for the expression to use.
Let’s say we have an application that is uploading documents to an instance of Kodexa. When the upload is associating a value in metadata called “ShouldPublishXml”, the value can be True
or False
. As we load the document into the document store, we want to determine if this metadata flag is present, and if it is there and not set to True
we want to add a label dont_publish
to the document. In order to do this, we will want to create a label expression at the document store level that has properties:
label: dont_publish
expression:
containsKey('ShouldProcessXML') && ['ShouldProcessXML'].toLowerCase() != 'true'
This expression will then be evaluated - if the expression returns not True (not case-sensitive), then we will add the label. If the expression returns a string value then we will use this as the name of the label, for example lets say we wanted to add a label that was the value of the metadata field available on upload called 'CustomerName'. We would use the expression:
containsKey('CustomerName') ? ['CustomerName'] : null
Expression Labels are part of the Store Metadata, this is available at:
/api/ stores / { organizationSlug } / { storeSlug } / metadata
← Previous
Next →