Improvements to Voyager's Spatial Indexing

Voyager Search version 1.9.6 brings some notable improvements with regard to support for spatial types. Prior to this release, Voyager supported points and bounding boxes. It is now possible to store spatial objects with any geometry, including polygons and geometry collections.

These improvements allow more accurate represention of spatial objects in the search index. For example, consider the United States of America:

 

Using only a bounding box, the best we can do to represent this geometry is the following:

 

Clearly, this is not an accurate representation as the original shape is less than 25% the area of the entire bounding box. With Voyager 1.9.6, it’s possible to achieve a much more accurate representation that leads to the ability to do much more accurate spatial search.

Consider a spatial search for finding all cities within a particular state using California as an example. The following shows California and neighbouring state Nevada.  

 

 

With only a bounding box, the best spatial query for California is shown below

 

As in the previous example, this spatial search isn’t very accurate as it matches cities outside of California. However with the ability to query with the actual geometry of California, an exact spatial match can be achieved:

 

Enabling the New Spatial Support

In Voyager 1.9.6, enabling the new spatial support depends on whether it is a new install or an upgrade. New installs have the new spatial support enabled by default, so users installing a fresh version of Voyager need not do anything to start taking advantage of accurate spatial search.

Users upgrading from a previous version of Voyager must explicitly enable the feature and also rebuild the search index. The spatial settings can be configured from the Voyager management interface at Manage Voyager > Index > Spatial Settings.

 

From this page the new spatial support can be configured with the following settings:

  • Store bounding box only disables the new spatial support, causing Voyager to behave as it did pre-1.9.6 with respect to spatial indexing. This is the default setting for upgrades.

  • Store both geometry and bounding box enables the storage of both bbox and full geometry as described above. This is the default setting for new 1.9.6 installations.

Indexing Geometries

Once the spatial indexing settings have been configured to store full geometry data, the next step is to index data from a location that is capable of reading full geometries. At this time, the following location types support reading geometry data:

  • Database tables (the Tables tab on the Locations page)

  • Vector data sources (the Data tab on the Locations page)

Considerations

Users hoping to take advantage of the new spatial features of Voyager 1.9.6 should be aware of the following factors:

Index Size

An important aspect of storing full geometry data is the impact on the size of the index when storing full geometries. Depending on the complexity of a geometry (overall size, coordinate density, etc…), storing it can often take up considerably more space than storing just the four coordinates it takes to represent a bounding box.

To help mitigate this concern, the new spatial indexing support comes with the option to generalize (simplify) geometries before storing them in the index. The process of generalization takes an initial geometry and removes coordinates from it to create a slightly less accurate, but much smaller version. The following diagram illustrates the process.

 

The object on the left shows the original version of the polygon, which takes up about 1.6 kilobytes when stored in the index. The object on the right is the same shape generalized down to 24 points and 0.40 kilobytes of storage.

The following diagram shows the two geometries overlaid on top of one another:

 

As is shown by the diagram, we are able to maintain the overall shape of the object with a representation that takes about 25% of the size to store. This illustrates the importance of generalization in achieving accurate spatial search while keeping total index size manageable.

Generalizing is a tradeoff between accuracy and space. The above example can be taken even further with a higher level of generalization:

 

The storage cost is lower still but at the same time the shape is starting to distort significantly from the original. Taking generalization to the extreme eventually degrades into the bounding box of the original shape:

 

In the end, the right level of generalization is a trial-and-error process to determine the sweet spot between accuracy and storage cost. It also depends on the use case. If accuracy is paramount and you are willing to accept a larger index to achieve higher spatial accuracy, then a lower level of generalization can be utilized. Conversely, if accuracy is a secondary concern to index size, then a higher level of generalization can be utilized.

The generalization setting is set on a location-by-location basis. On the Settings page for a location, there is a new tab called Geometry.

 

The generalization value ranges from 0 to 1. A value of 0 specifies no generalization and results in the original geometry with full detail. A value of 1 specifies full generalization and results in the bounding box of the original geometry. Values in between are scaled into that spectrum moving from gradually less detail as values move from 0 to 1.

Spatial Query Performance

In addition to increased storage requirements, full geometries also come with an increased cost for computing spatial calculations.Spatial operations such as intersection are much more computationally intensive for arbitrary geometries than for bounding boxes. To mitigate this, Voyager 1.9.6 comes with a setting to enable the use of geometry during spatial queries.

 

The Use geometry for spatial queries option provides a tradeoff between query accuracy and query performance. Enabling the setting will result in more accurate spatial search at the cost of increased computational resources.

Understanding how this option works requires understanding some of the theory behind spatial indexing in Voyager.

Geohash Indexing

Voyager 1.9.6 introduces a geohash based index for spatially indexing geometries. The geohash index is made up of multiple levels. At level 0 the index is composed of a single grid cell. Each subsequent level is created by dividing into 32 sub grid cells. Therefore at level n the index represents a grid of 32n cells.

The following diagram illustrates the structure of the geohash index:

 

With this scheme geometries are indexed by computing which grid cells intersect the geometry. The following diagram illustrates the indexing of the California state polygon at level 3 of the index:

 

As we move to higher levels indexing resolution increases. The following diagram shows the same polygon indexed at level 4:

 

With this indexing spatial queries work as follows:

  1. Compute the grid cells for the search polygon

  2. Calculate spatial objects that intersect the grid cells

  3. Optionally refine the results with a full geometry intersection

Consider again the example of finding all cities in the state of California. Following the steps above the resulting query looks like the following:

 

As can be seen the query matches cities outside of the search area. There are two methods of dealing with this. The first is to increase the grid resolution as shown in the following diagram:

 

In increasing the grid resolution by using a higher grid level for the query shape the point has been filtered out of the result set. Unfortunately this approach has a couple of downfalls. The first is that it is not always guaranteed to work. Since the grid is an approximation for the query shape there will always be cases where false positives arrives. The second is that increased resolution comes with a cost as it requires is to store more information in the index per shape, and it requires more computational resources to compute intersecting grid cells.

An alternative to increasing the grid resolution is to post filter results with a full spatial intersection. In this case the geohash grid serves as a proxy whose job is to quickly filter the candidate result set to things that are close. Then the results are refined with a full spatial intersection. The downside of this approach is that computing full intersection is an expensive operation so we really only want to use it in cases where we have a small number of results to test or else it can hinder query performance. For this reason it is a configurable option in the Voyager spatial settings as described in the previous section.