Creating a Custom Extractor Using Python

This guide describes how to create and add your own custom Python Extractor. Voyager uses extractors to detect file content types (MIME types) and read (ingest) content, including metadata where present. Different file formats may require specific extractors, although some extractors have the ability to read more than one file format.

The information gathered by extractors is then passed through the Indexing Pipeline, and is then made available to Voyager’s powerful search capabilities. You can create custom extractors to open a particular file type or input stream to gather the initial information that will be passed through the indexing pipeline.  Using Python, you can create a wide variety of extractor types tailored to the specific data types in your system.

To create a custom extractor using Python, you need to:

  • Create the Python file

  • Configure a format to use the new extractor

  • Test the new extractor

Creating a Python file

To create the Python file for your custom extractor:

  1. Create a Python file in the vgextractors folder located in Voyager’s install location (i.e. C:\Voyager\server_1.9.7.3360\app\py\extractors\vgextractors)

  2. Copy and paste the sample code below

import os   import json   from _vgdexfield import VgDexField   class JSONExtractor(object):     @staticmethod   def extractor():   return "json"   def get_info(self):   return {'name': JSONExtractor.extractor(),                   'description' : 'Extract a JSON file using Python',                 'formats': [{'name': 'text', 'mime': 'application/json',                                'priority': 10}]}     def extract(self, infile, job):   """ Main function to generate properties of a JSON file (.json). The JSON is loaded as a dictionary and each key is set as a field. :param infile: The JSON file  :param job: A job object (use to set fields) """   # Set some of the most basic properties.   json_name = os.path.splitext(os.path.basename(infile))[0]   job.set_field(VgDexField.NAME, json_name)   job.set_field(VgDexField.PATH, infile)   job.set_field(VgDexField.FILE_EXTENSION, 'json')   job.set_field(VgDexField.FORMAT, 'application/json')   job.set_field(VgDexField.FORMAT_CATEGORY, 'JSON')     # Load file into dictionary.   with open(infile, 'rb') as json_file:   json_dict = json.load(json_file)    for key, value in json_dict.iteritems():   if isinstance(value, str) or isinstance(value, unicode):   job.set_field('fs_{0}'.format(key), value)   elif isinstance(value, int):   job.set_field('fl_{0}'.format(key), value)   elif isinstance(value, float):   job.set_field('fu_{0}'.format(key), value)   else:   job.set_field('meta_{0}'.format(key), value)     # Finally, read the file and set the text field.   job.set_field(VgDexField.TEXT, json.dumps(json_dict))
  1. Save the file with a name ending with, Extractor (i.e. JSONExtractor.py)

Notes

  • The name of each Python file must end with Extractor (i.e. JSONExtractor.py).

  • Each Python file must include a class with the same name as the file.

  • The class must have a static function named extractor which returns the extractor name.

  • Each class must have a function named get_info which will register the name, description and formats for which this extractor will be available. In the example above, this extractor will be available for JSON files (.json).

  • When Voyager starts up the extractor will be registered and should be listed on: http://localhost:8888/manage/#/discovery/extractor/

  • A mime-type must be provided. The mime-type for a format can be located by clicking on a format from the formats page: http://localhost:8888/manage/#/discovery/formats/

  • The class must also include an extract This function includes the code for generating and setting the properties. The job object has a function to set fields and their values. The _vgdexfield module contains a class named VgDexField with a large set of voyager fields that can be set.

Configuring a Format to Use a Python Extractor

To configure a format to use your Python extractor:

  1. Open the formats page by going to,Manage Voyager > Discovery > Formats. Search for json. You should now see the Extractor as: json tika

  2. Click the name, JSON and ensure the py/json extractor is checked on.

 

 

Testing the Python Extractor

To test your new custom Python extractor:

  1. Add a location containing one or more json files

  2. Build/Rebuild the index for that location. This will use the Python extractor and set the fields defined in the script

  3. For example, this json will produce the following result:

{ "id":"home", "description": "Home address", "city": "Redlands", "street_number": 1, "steet_name": "Main St.", "zip": "92373", "cross_streets": ["North", "South"] }