Key Concepts

Key Concepts

In order to address the problem of reading and understanding unstructured data, many technology companies have developed purpose-built solutions to solve immediate problems. These solutions, however, have a common problem - they aren’t developed under a content framework to allow organizations to implement a single platform to solve all of their needs.

The founders of Kodexa recognized this gap in the industry and have spent many years engaged with some of the world's largest financial institutions helping them expand their data ecosystem to solve their unstructured data problems. Those experiences led to the creation of Kodexa, a flexible platform focused on:

  • Saving time by creating reusable components to solve problems that are common to all organizations;
  • Allowing teams to create custom components to solve issues specific to their needs;
  • Providing a standardized data format which allows both common and custom components to operate on a unified platform against the same data; and
  • Ensuring accuracy of results by creating a well-tested, fully documented, repeatable process.

Documents and Content Nodes

Kodexa Document

Working with unstructured data is a challenge for many reasons - the most obvious obstacle is the data's lack of formal structure. All frameworks that attempt to process unstructured data try to apply some structure; however, this can be challenging because unstructured data varies by type and by content. Imposing a structure requires these various types to be normalized in some way, and it must be done without losing fidelity.

Difficulties in structuring the data are further complicated when the data will be provided to third-party models/functions for processing. Different providers are likely to require the data to be structured in different ways to meet their needs, not yours. Data structures end up being dependent on the original data source type, the normalized structure imposed by the processing framework, the needs of third-party tools, and any use-case specific requirements.

Trying to fill these needs has traditionally led to the creation of (1) overly simplistic normalized structures that have lost important details, or (2) overly rigid structures that are constructed to work with specific models/functions but cannot be used more widely.

At Kodexa, our content model is called the Kodexa Document. It is a generalized data structure flexible enough to work with multiple sources of data (PDF, Image etc) while also being rich enough to support the management of features and the application of tags.

Content Nodes

Content nodes are the structures that provide the needed flexibility to the Kodexa Document. Documents are represented in a generalized structure consisting of a collection of metadata and a set of content nodes. This structure may be thought of as a rich tree model, with a root content node at the top and one or more child content nodes branching off as leaves. Each child content node contains some portion of the document's value. This tree structure allows us to enable navigation within the tree and maintain lineage between the parent and child nodes.

Labels, Features and Selectors

The Kodexa Document introduces three powerful concepts that, when combined, provide the building blocks of our processing.

Labels

Adding labels to a Document is a way to "tag" data within the Document. It allows you to easily mark parts of the structure or the text within a node with a specific meaning. For example, when processing an HTML file, you may want to add a label named "Hi" to every node of type 'p' (paragraph) that has the phrase "Hello". In a later processing step, you may select the nodes with the "Hi" tag and perform another action on them. Since processing steps can refer to the presence of previously applied tags, tags provide a powerful and flexible way for incremental understanding of the Document.

document.select('//*[contentRegex(".*Hello")')[0].tag('introduction')

Features

Features are similar to tags in that they can be added to nodes to provide additional information about the node or its contents. Features record more granular information than tags, such as spatial co-ordinates identified during parsing or the entity type for each word in a node's content when performing Named-Entity Recognition (NER) processing.

When you start solving problems with Kodexa, you will learn that the flexibility of the Document is your friend. It provides you a consistent way to work across use-cases, and since the model and API is consistent, you can write re-usable code that can be leveraged in multiple use-cases.

document.select('//*[contentRegex(".*Hello")')[0].get_feature('spatial','bbox')

Selectors

Selectors allow you to search the Document based on its content, structure, or both. These are similar to XPath queries, but have been tailored to work with our content model. You will use selectors to identify areas of the Document to which you want to apply additional processing/extraction.

For example, you can quickly, and consistently, query different documents:

document.select('//*[contentRegex(".*Hello")')

or even combine queries together, mixing features and tags:

document.select('//*[contentRegex(".*Hello") stream .[hasTag("intro")')

Actions, Pipelines, and Steps

Actions

Actions provide the processing capabilities in Kodexa. All actions accept a Document, perform some evaluation or process on the Document, and then return a Document. These components are configured in a pipeline when they are wrapped in a pipeline's step.

Kodexa actions all implement the same interface and work against the structure of the Kodexa Document. By supporting this universal interface, you can bring together multiple action implementations to solve an almost limitless set of problems.

We classify actions into one of the following types:

Parsers

Parsers take the metadata from a Kodexa Document and work against a "source" to build the content structure for a document.

A parser will always remove all the content in a document and replace it.

Taggers

Taggers add tags or features to parts of a Document's content to enrich the content in some way. The key distinction that classifies an action as a "Tagger", is the action's addition of information to the Document without changing its structure or content.

Transformers

A transformer is an action that changes the structure of a Document. For example, it may remove a certain type of node or collapse the nodes in a structure. Transformers may also add new nodes (such as columns or sentences) to the Document.

Extractors

An extractor is used to pull tagged data from a document and put it into a structured form. These structured forms may be tabular, like a CSV, or more document-like, like JSON.

Pipelines

A pipeline is a linear set of steps that can be applied a Kodexa Document. Each step calls an action which will either parse, enrich, transform, or replace the document. This approach allows you to assemble a set of steps that can enable to structure, tag and normalize a document, file or textual content.

To promote re-use and composability, these pipelines can be defined in code or metadata.

Let's take a look at how a pipeline logically works:

The pipeline is a collection of steps, each calling an action. A Kodexa Document is passed from the start to the end, and each action accepts the document and returns a document. This doesn't mean that the action is returning the same document that it received, since each action can change the content of the document along the way.


Did this page help you?