Creating a Custom Python Pipeline Step in HQ

This tutorial demonstrates how to create a new Pipeline step in HQ using Python. Pipelines provide functions to transform and enhance data as it is indexed. This document will show how you can add a new field to an entry or remove a field from an entry.

Before getting started, it is important to know how to set up and apply a pipeline to a repository in HQ. For more information, see Creating Pipelines.

Follow the steps below to create the Python pipeline.

Before You Begin

Make sure the following are installed:

  • HQ

  • Python 2.7 or higher

  • A Python IDE (recommended)

Creating a New Pipeline Folder

Go to your HQ home directory:

  • On Windows systems, the default home location is the AppData directory, typically C:\Users\<user>\AppData\Roaming\Voyager\hq

  • On Linux systems, the default home location is the user's home directory at ~/.voyager/hq

  • Open the py directory (the voyager directory has a Python module which includes a base class for voyager pipelines)

  • Create a folder named pipeline

Creating a New Python Script

  • Inside the pipeline directory, create a new Python script file named sample_pipeline.py.

  • Edit the Python script file and add the following code:

 

from voyager import PipelineStep class SampleStep(PipelineStep): def __init__(self): super(SampleStep, self).__init__() def info(self): """ Provides information about the pipeline step including name, title, description and parameters. Parameters are optional. :return: A JSON object/dictionary """ return { "name": "sample", "title": "Sample Step", "description": "Sample pipeline step", "params": [{ "type": "string", "name": "add", "title": "Add", "description": "Field to add" }, { "type": "string", "name": "remove", "title": "Remove", "description": "Field to remove" }] } def run(self, entry, config): """ Runs the pipeline step. This method works by modifying the :param entry: parameter, mutating fields, etc... :param entry: The entry being indexed. :param config: The pipeline step configuration. """ print("INFO adding") add = config["add"] if add: entry["fields"][add] = "foo" print("INFO removing") remove = config.get("remove") if remove: entry["fields"].pop(remove) if __name__ == "__main__": PipelineStep.main(SampleStep())

Creating the Pipeline

To create the pipeline:

  • In HQ, select Pipelines in the Indexing section

  • Click Create Pipeline

  • For the purposes of this tutorial, select Post Scan

  • Click Add Steps

  • Select Sample Step

 

  • Enter a Name and Description for the pipeline as well as the field name, for example TestField

 

  • Click Save and then return to the pipelines page to confirm that the pipeline was created

 

Testing the Pipeline

  • To test the pipeline, add it to a repository either by editing an existing repository or creating a new one

  • From the pipeline tab, enter the sample pipeline

  • Index the repository

  • After indexing is complete, view the results in Navigo and check if the field TestField exists

Learning More

Take some time to examine the Python code and read the documentation strings and comments. An entry which is sent to the run function is a Python dictionary with the required fields. An entry would look like this:

{           'fields': {               'meta_table_name': 'world_countries.csv',               'name': 'Vanuatu',               'repository': 'r16524da57d1',               'format': 'text/csv-record',               'format_category': 'Office',               'fs_SQMI': '3265.07',               'fs_FIPS_CNTRY': 'NH',               'fs_STATUS': 'UNMemberState',               'fs_POP2005': '205754',               'format_type': 'Record'           }   }  

The config argument which is sent to the run function is the pipeline configuration, and includes parameters and their values. After creating and saving a pipeline, you can view the configuration by opening the pipelines.json file located in the config directory in your HQ home location.