Architecture#

The information extraction pipeline traverses through three stages of abstraction:

  1. File format

  2. Content

  3. Predicate-value pairs

For example, an image can be stored in various file formats (JPEG, TIFF, PNG). In turn, a file format can store different kinds of information such as the image data (pixels) and additional metadata (image dimensions, EXIF tags). Finally, we translate the information read from the file into predicate-value pairs that can be attached to a file node in BSFS, e.g., (bse:filesize, 8150000), (bse:width, 6000), (bse:height, 4000), (bse:iso, 100), etc.

The extraction pipeline is thus divided into Readers that abstract from file formats and content types, and Extractors which produce predicate-value pairs from content artifacts.

Readers#

Readers read the actual file (considering different file formats) and isolate specific content artifacts therein. The content artifact (in an internal representation) is then passed to an Extractor for further processing.

For example, the Image reader aims at reading the content (pixels) of an image file. It automatically detects which python package (e.g., rawpy, pillow) to use when faced with the various existing image file formats. The image data is then converted into a PIL.Image instance (irrespective of which package was used to read the data), and passed on to the extractor.

Extractors#

Extractors turn content artifacts into predicate-value pairs that can be inserted into a BSFS storage. The predicate is defined by each extractor, as prescribed by BSFS’ schema handling.

For example, the class ColorsSpatial <bsie.extractor.image.colors_spatial.ColorsSpatial determines regionally dominant colors from given pixel data. It then produces a feature vector and attaches it to the image file via the appropriate predicate.

BSIE lib and apps#

The advantage of separating the reading and extraction steps is that multiple extractors can consume the same content, avoiding multiple re-reads of the same data. This close interaction between readers and extractors is encapsulated within the Pipeline class.

Also, that having to deal with various file formats and content artifacts potentially pulls in a large number of dependencies. To make matters worse, many of those might not be needed in a specific scenario, e.g., if a user only works with a limited set of file formats. BSIE therefore implements a best-effort approach, that is modules that cannot be imported due to missing dependencies are ignored.

With these two concerns taken care of, BSIE offers a few end-user applications that reduce the complexity of the task to a relatively simple command.