The Flex Index provides a highly scalable solution that makes it possible to create Indexes with virtually no size limits. It replaces the existing, physical Index with a distributed logical Index. Before going into detail about the Flex Index, it’s useful to understand what Voyager's Solr index is and how it functions.
Overview of Voyager's Index
The Index is a catalog of all of the data records found by Voyager after it scans the specified locations. Much like a card in a card catalog, each record is only a reference to the actual data. This means data doesn’t need to be moved or consolidated anywhere – it stays where it is. Voyager can read over 1,800 different data types and more can be added as necessary.
What is Indexing?
Indexing is the process of creating the searchable index, and consists of two phases: Scanning and Extraction:
Scanning is done by Connectors, which crawl the repository and gather easily accessible information from each document, such as file name, size, last modification date etc. This process does not involve opening any files.
Extraction is done by Extractors which open data files and extract additional information from data crawled during the scanning phase. This might involve reading text from a Word document or reading metadata tags from an image file. Voyager chooses the extractor based on the format (mime type) of the data. An Extractor can often operate on multiple related file types, and a particular type of file may have more than one associated Extractor.
Where is the Index Stored?
Typically, the standard Voyager Index resides on the same server as the Voyager software for fastest response times. Each standard index is associated with a single Voyager instance. However with the Flex Index, described below, the index is stored in pieces on different servers.
Scaling the Index
The major limitation of the standard Index is its inability to scale due to its dependence on a single machine. The more-or-less fixed hardware and software resources on that machine dictate the upper limits of an Index. A partial solution to this problem involves Federation. Federated searches link multiple, separate Voyager indexes that can be viewed within another Voyager instance. The end result is a single point of search on a top-level federated index which draws information from indexes of multiple satellite Voyager instances.
Although Federation does allow for larger indexes, it requires more configuration and deployment overhead that may make it inappropriate for some deployments. Each of the index pieces is still tied to a single machine with its own specific limitations, which may not provide a sufficient level of fault-tolerance.
A Fully Distributed Index – the Flex Index
How it Works
The Flex index is no longer tied to a single server, so the restrictions imposed by that dependence are removed. Larger and larger Indexes can be handled by adding sufficient Shards to the mix, making the Flex Index capable of virtually unlimited scope. Unlike a single Index, the Flex Index can be tuned to the specifics of a particular deployment and data set. When the Index is distributed among multiple servers, overall performance for indexing and search may be greatly increased. More machines working in parallel can translate to much faster indexing operations. More importantly, Solr is always looking for the most efficient way to distribute a query to a Shard and its Replicas for maximum efficiency.
The Flex Index (based on SolrCloud) distributes the index into multiple logical pieces called Shards.
Each Shard has one or more Replicas located on different servers, so each of the logical partitions of the Index always has one or more copies. The initial numbers of Shards and Replicas are determined by the size of the Index.
If there is more than one Replica, one of them is designated as the Leader. Leaders play a role in the indexing process and are automatically elected. If a leader goes down, one of the other Replicas is automatically elected as the new leader.
Fault Tolerance
Replicas are critical to some of the most important aspects of the distributed Flex Index: redundancy and fault tolerance. For example, when one Replica is unavailable because a server goes down, the other Replicas can continue to participate in indexing and search. In the example below, Replica 1a is offline but Replicas 1b and 1c are still available.
Keeping Track of the Pieces - Zookeeper
SolrCloud and the Flex Index require an entity called Zookeeper to monitor and keep track of all of the pieces of an Index (Shards and their Replicas). Part of what ZooKeeper does is to determine which servers are up and running at any given time. It also makes takes care of Leader election as well, usually when an individual server is down or unreachable. When Solr needs information about the distribution and states of Shards and Replicas, it consults Zookeeper. That way Solr can manage indexing and search without having keep track all of the different parts of the index – that’s Zookeeper’s job.
Indexing the Flex Index
During indexing, Solr sends a document to the current leader of a shard for indexing, then distributes the update to all the shard replicas. The leader’s job is to make sure that all replicas are up to date with the same information stored in the leader. Index information is viewable in Voyager Navigo, and there is no difference between the way a Flex Index is displayed and the Standard,machine-dependent Index.
Searching the Flex Index
To the user, the Flex Index is invisible and their search experience is virtually unchanged from previous versions. When a user initiates a search query, Solr uses the information in the Zookeeper database to decide which servers and Replicas should handle that query. When a Solr node receives a search request, the request is routed to a replica of a shard that is part of the collection being searched. The chosen replica acts as an aggregator: it creates internal requests to randomly chosen replicas of every shard in the collection, coordinates the responses, issues any subsequent internal requests as needed (for example, to refine facets values, or request additional stored fields), and constructs the final response for the client, in this case, Voyager Navigo.
The Future of Indexing
The capabilities that the Flex Index offers uncouples the Index from a single, dedicated server. The logical distribution of subsets of the Index, independent from machines and servers, will make possible larger and larger data sets and resultant Indexes. The Cloud-nature of the Flex Index opens up even more opportunities for increases in performance, efficiency and reliability.