The Data Types are used by the Extraction Engine which is part of the core platform. This occurs when you designate one or more data structures and a data store in a pipeline.

The final document is passed to the Extraction Engine which then builds the data objects and data attributes linked back to the labeled document. Data types impact only the Data Attribute, and a data attribute is designed to hold multiple representations of the piece of data. Currently, we have the following data types:

TypeDescription
StringThe most basic data type that can hold any type of information as a string of characters
DateSupports capturing a date without a time element, the date is that defined to a local to UTC
Date/TimeSupports capturing a date with a time element, the date is that defined to a local to UTC
Phone NumberTries to normalize a phone number
Email AddressTries to convert the labeled content to a valid email address
SelectionTries to match the value labeled to a list of available options
NumberTries to convert the labeled content to a number
CurrencyTries to convert the labeled content to a valid currency (decimal)
BooleanTries to convert the labeled content to a boolean value

Understanding Normalization (Coalescing)

When we label text in a document, it is always a “string”. This just means we are capturing text and not trying to standardize (or normalize) it in any way at all.

However, most systems that will use the data from Kodexa will want to know that the data is a specific type. They would want things to be numbers or dates, etc. This process is handled when we try to set the data type on a data attribute. The Extraction Engine will take the text that is labeled in the document and try to coalesce the data into a specific form - for example it might take “1.0” as a string and turn it into a number.

This is important since it means the system using the data from Kodexa knows the data is “valid” for that “Data Type”. In that case, if the data type is a number, it will not allow “abc” for that data attribute.

Algorithms for Coalescing

In the following table we will break down how we coalesce the data from labeled data to the data type.

Data TypeDescriptionAlgorithm
Date or Date/TimeThe extraction engine will use an NLP framework to try and convert the labeled text to a date/timeNLP conversion
BooleanIf the text (in lowercase) is “true” then it is true, else it is falseSimple string match
CurrencyAttempt to convert to a decimalDecimal conversion
EmailExtract the email addressRegex validation
NumberParse as a decimal numberNumeric parsing
Phone NumberParse the phone number using Google’s LibPhoneNumberLibrary parsing
Selectable OptionNothing right nowN/A

How is Typed Data Stored?

A data attribute has the ability to store multiple representations of a piece of extracted data, depending on the data type defined in the data structure one or more of the properties of the Data Attribute will be updated.

PropertyDescriptionApplies to
valueThis is the raw value that was captured from the labelAll
stringValueThis is the raw value as a stringSelectable Options, String
dateValueThis is the date/time in ISO format (YYYY-MM-DD and YYYY-MM-DDThh:mm)Date, Date/Time
booleanValueThis is the boolean valueBoolean
decimalValueThis is the number or currency valueCurrency, Number

Content Source

When we are extracting data from a document label we are capturing the text that is labeled. This is the “raw” value that we are capturing. This is the value that is stored in the “value” property of the data attribute. However, we also need to understand where that raw value comes from in the document. This is handled by the ‘Content Source’ property of the taxon.

We support the following types of content source:

Content SourceDescription
Value or All ContentThis means that we will look at the label, and we will see if the label has been given a value. If so, we will use this. However, if the label does not have a specified value then we will take all the text that the label has been applied to and use that as the value
Value OnlyThis means that we will look at the label, and we will see if the label has been given a value and use that, if the label did not specify a value we return null
All ContentThis means that we will look at the label, and we will take all the text that the label has been applied to and use that as the value
ExpressionThis allows the user to define an expression that will be used to capture the value, see Expressions below
ScriptThis allows the user to define a script that will be used to capture the value, see Scripts below
MetadataThis allows the user to choose a metadata object that will be used as the value, see Metadata below

Expressions

Expressions are a way to define a value that will be used to capture the value of the data attribute. Expressions are defined using the Spring Expression Language (SpEL) library. When you are writing an expression, the context is the data object that you are working with, and the result of the expression will be the value that is assigned to the attribute.

Since the data object is the context, you can use methods from the data object in the expression. For example, if you wanted to get the value of another attribute, you can use:

getAttribute('attributeName').getValue()

We also have other objects available as variables to use in the expression. For example, if you wanted to get a piece of information from the metadata of the source document you can use the document as a variable:

metadata['CorrelationId']

The objects that are available to the expression are:

Object NameDescription
documentThe document that the data object is associated with
dataObjectThe data object that the expression is being evaluated against
metadataThe metadata of the document that the data object is associated with
familyThe document family that the document is associated with

Scripts

Scripts are a way to define a value that will be used to capture the value of the data attribute. Scripts are defined using the Groovy language. A script works slightly differently from an expression. The script has the attribute as a variable available to it, and you can assign the value to the attribute directly in the script.

The objects that are available to the script are:

Object NameDescription
attributeThe attribute that the script is being evaluated for
documentThe document that the data object is associated with
dataObjectThe data object that the expression is being evaluated against
metadataThe metadata of the document that the data object is associated with
familyThe document family that the document is associated with