At the heart of Kodexa is the concept of a Document. In our world a document represents the content, metadata and features associated with any type of information. Rather than storing the information as text we store it as a heirarchical structure that comprises of content nodes.
What is a Connector?
A connector in Kodexa is a powerful tool that bridges the gap between a Kodexa Document instance and its original source content. It provides a mechanism to access and retrieve the raw data from which the document was initially created.
Connectors play a crucial role in scenarios where you need to interact with the original file format. For instance, if you're working with a PDF document that has been parsed into a Kodexa Document, you can use a connector (such as the local filesystem connector) to fetch the original PDF file. This capability is particularly useful for tasks like OCR processing or extracting text directly from the PDF.
One of the key advantages of connectors is their flexibility in the document processing pipeline. You can utilize them at any stage of processing a Kodexa document. For example, you might have a pipeline that first parses a PDF, then performs layout analysis, and finally identifies tables. At any point in this process, you could use a connector to access the original PDF file, perhaps to apply computer vision techniques for more advanced table detection.
Here's a simple representation of such a pipeline:
graph LR
A[Parse PDF] -->B(Parse PDF)
B --> C(Layout Analysis)
C --> D(Table Identification)
How is a Connector implemented?
Implementing a connector in Kodexa is straightforward. A connector class must implement a single static method called get_source
. This method takes a document as input and returns the bytes of the original content.
Let's examine a simple example of a connector implementation:
from kodexa import Document
class LocalFilesystemConnector:
@staticmethod
def get_source(document: Document) -> bytes:
return open(document.source, 'rb').read()
In this example, the LocalFilesystemConnector
class defines a get_source
method that opens the file specified by the document's source attribute and reads its contents as bytes. This implementation allows easy access to files stored in the local filesystem.
Connectors can be implemented for various source types, such as cloud storage services, databases, or remote file systems, providing a uniform interface for accessing original document content regardless of its location or storage mechanism.
By leveraging connectors, Kodexa offers a flexible and powerful way to maintain a link between processed documents and their original sources, enabling more sophisticated document analysis and processing workflows.
← Previous
Next →