Using the new Web Connector
Voyager HQÂ now includes a Web Connector that allows crawling of specific websites. This connector uses Scrapy, an open-source Python framework for gathering information from websites.Â
Creating a Web Connector Repository
To create a Repository using the Web Connector:
Go to the Repositories list in HQ
Click Add New
Click Web
Click Next
In the Web Type drop-down menu, select Web Connector
Â
Entering Connection Information
On the Enter Connection Information page:
Â
Enter a Name for the Repository
Enter the Start URL for the Connection - this is where the Web Connector will begin crawling (you can add multiple URLs)
Choose Crawl All Links to have the connector follow all links that it finds (if you do not want the connector to follow all links, leave this blank and specify which links to follow in the Advanced configuration section)
Click Next to Continue or click Advanced to see the advanced options
Â
Advanced Configuration Options
The Advanced options allow fine-tuning of the web connector's behavior and settings.
Â
Links To Follow
Add XPath Selectors for links to follow if Crawl All Links is falseConcurrent Requests
The Number of requests to be made concurrently (default is 10) - setting to 0 will disable any request throttlingDepth Limit
Depth of links to crawl from the start URLs, (default is 5) - set to 0 for unlimited depthFiles To Index
Specify the file extensions that will be indexed, if presentIndex Files As Linked Data
If set to true, any files matching Files To Index will be added as links to the indexed page (viewable in the relationships tab of a record’s Detail View in Navigo) - if set to false, these files will be added as individual documentsAllowed Domains
Only links within the specified domains will be followed and indexed.Crawler Settings
You can find information about additional settings hereField Mapping
Specify fields and their corresponding XPath selectors - any content found with the selector will be added to the specified field