HQ - Extractors

HQ can index thousands of different file formats, and each Format has specific Connectors and Extractors corresponding to the two phases of Indexing - Scanning and Extraction. An Extractor can often operate on multiple related file formats, and a particular type of file may have more than one associated Extractor. 

About Extractors, Formats and Indexing

To put Formats and Extractors in context, it is helpful to describe where they fit in the two phases of Indexing - Scanning and Extraction.

  • Scanning is done by Connectors, which crawl the repository and gather easily accessible information from each document, such as file name, size, last modification date etc. Scanning does not open any files. All readable data types have Connectors, and the Connector framework can easily be extended to support new formats, for example the new Web Connector. 

  • Extraction is done by Extractors which open data files and extract additional information to the data crawled during the scanning phase. This might involve reading text from a Word document or reading metadata tags from an image file. HQ chooses the extractor based on the format (MIME type) of each file or document.   

Viewing the Extractor List

To view the available file formats:

  1. Open HQ

  2. Select Indexing

  3. Click Extractors

 

 

The window displays the list of Extractors, shows whether they are enabled and the number of File Formats that use that Extractor.

 

Viewing Extractor Information

  • Click an Extractor to view more information about it, for example tika displays the following:

 

 Searching Extractors

  • To search the Extractor list, enter a keyword in the search box.