Using the new Web Connector

Voyager HQ now includes a Web Connector that allows crawling of specific websites. This connector uses Scrapy, an open-source Python framework for gathering information from websites. 

Creating a Web Connector Repository

To create a Repository using the Web Connector:

  1. Go to the Repositories list in HQ

  2. Click Add New

  3. Click Web

  4. Click Next

  5. In the Web Type drop-down menu, select Web Connector

 

Entering Connection Information

On the Enter Connection Information page:

 

  1. Enter a Name for the Repository

  2. Enter the Start URL for the Connection - this is where the Web Connector will begin crawling (you can add multiple URLs)

  3. Choose Crawl All Links to have the connector follow all links that it finds (if you do not want the connector to follow all links, leave this blank and specify which links to follow in the Advanced configuration section)

  4. Click Next to Continue or click Advanced to see the advanced options

 

Advanced Configuration Options

The Advanced options allow fine-tuning of the web connector's behavior and settings.

 

  • Links To Follow
    Add XPath Selectors for links to follow if Crawl All Links is false

  • Concurrent Requests
    The Number of requests to be made concurrently (default is 10) - setting to 0 will disable any request throttling

  • Depth Limit
    Depth of links to crawl from the start URLs, (default is 5) - set to 0 for unlimited depth

  • Files To Index
    Specify the file extensions that will be indexed, if present

  • Index Files As Linked Data
    If set to true, any files matching Files To Index will be added as links to the indexed page (viewable in the relationships tab of a record’s Detail View in Navigo) - if set to false, these files will be added as individual documents

  • Allowed Domains
    Only links within the specified domains will be followed and indexed.

  • Crawler Settings
    You can find information about additional settings here

  • Field Mapping
    Specify fields and their corresponding XPath selectors - any content found with the selector will be added to the specified field