Creating a Custom Extractor Using Python
This guide describes how to create and add your own custom Python Extractor. Voyager uses extractors to detect file content types (MIME types) and read (ingest) content, including metadata where present. Different file formats may require specific extractors, although some extractors have the ability to read more than one file format.
The information gathered by extractors is then passed through the Indexing Pipeline, and is then made available to Voyager’s powerful search capabilities. You can create custom extractors to open a particular file type or input stream to gather the initial information that will be passed through the indexing pipeline. Using Python, you can create a wide variety of extractor types tailored to the specific data types in your system.
To create a custom extractor using Python, you need to:
Create the Python file
Configure a format to use the new extractor
Test the new extractor
Creating a Python file
To create the Python file for your custom extractor:
Create a Python file in the vgextractors folder located in Voyager’s install location (i.e. C:\Voyager\server_1.9.7.3360\app\py\extractors\vgextractors)
Copy and paste the sample code below
import os Â
import json Â
from _vgdexfield import VgDexField Â
class JSONExtractor(object): Â
Â
@staticmethod Â
def extractor(): Â
return "json"
Â
def get_info(self): Â
return {'name': JSONExtractor.extractor(), Â
                'description' : 'Extract a JSON file using Python',
               'formats': [{'name': 'text', 'mime': 'application/json',Â
                             'priority': 10}]} Â
Â
def extract(self, infile, job): Â
"""
Main function to generate properties of a JSON file (.json).
The JSON is loaded as a dictionary and each key is set as a field.
:param infile: The JSON fileÂ
:param job: A job object (use to set fields)
"""Â Â
# Set some of the most basic properties. Â
json_name = os.path.splitext(os.path.basename(infile))[0] Â
job.set_field(VgDexField.NAME, json_name) Â
job.set_field(VgDexField.PATH, infile) Â
job.set_field(VgDexField.FILE_EXTENSION, 'json') Â
job.set_field(VgDexField.FORMAT, 'application/json') Â
job.set_field(VgDexField.FORMAT_CATEGORY, 'JSON') Â
Â
# Load file into dictionary. Â
with open(infile, 'rb') as json_file: Â
json_dict = json.load(json_file)Â
Â
for key, value in json_dict.iteritems(): Â
if isinstance(value, str) or isinstance(value, unicode): Â
job.set_field('fs_{0}'.format(key), value) Â
elif isinstance(value, int): Â
job.set_field('fl_{0}'.format(key), value) Â
elif isinstance(value, float): Â
job.set_field('fu_{0}'.format(key), value) Â
else:Â Â
job.set_field('meta_{0}'.format(key), value) Â
Â
# Finally, read the file and set the text field. Â
job.set_field(VgDexField.TEXT, json.dumps(json_dict))
Save the file with a name ending with, Extractor (i.e. JSONExtractor.py)
Notes
The name of each Python file must end with Extractor (i.e. JSONExtractor.py).
Each Python file must include a class with the same name as the file.
The class must have a static function named extractor which returns the extractor name.
Each class must have a function named get_info which will register the name, description and formats for which this extractor will be available. In the example above, this extractor will be available for JSON files (.json).
When Voyager starts up the extractor will be registered and should be listed on: http://localhost:8888/manage/#/discovery/extractor/
A mime-type must be provided. The mime-type for a format can be located by clicking on a format from the formats page: http://localhost:8888/manage/#/discovery/formats/
The class must also include an extract This function includes the code for generating and setting the properties. The job object has a function to set fields and their values. The _vgdexfield module contains a class named VgDexField with a large set of voyager fields that can be set.
Configuring a Format to Use a Python Extractor
To configure a format to use your Python extractor:
Open the formats page by going to,Manage Voyager > Discovery > Formats. Search for json. You should now see the Extractor as: json tika
Click the name, JSON and ensure the py/json extractor is checked on.
Â
Â
Testing the Python Extractor
To test your new custom Python extractor:
Add a location containing one or more json files
Build/Rebuild the index for that location. This will use the Python extractor and set the fields defined in the script
For example, this json will produce the following result:
{
"id":"home",
"description": "Home address",
"city": "Redlands",
"street_number": 1,
"steet_name": "Main St.",
"zip": "92373",
"cross_streets": ["North", "South"]
}
Â