Spellcheck Pipeline Step
The Spellcheck pipeline step automatically corrects misspelled words in a field that contains a single word or text. This pipeline step enables us to improve the Natural Language Processing (NLP)/Optical Character Recognition (OCR) capabilities.
The Spellcheck step can be added to a pipeline like any other pipeline step. The main input is the Source Field parameter — which is the field of a document that would be spellchecked. By default, the Source Field is simply overwritten by the corrected text. Another option might be to set the Destination Field parameter so that the corrected version is placed into the Destination Field and the Source Field remains unchanged
For the advanced user, it is possible to fully control what and how Spellcheck operates. The Spellcheck step relies on the Solr spellcheck functionality, which enables it to derive a dictionary from an arbitrary index and its field (see documentation of Solr spellcheck at Spell Checking | Apache Solr Reference Guide 6.6).
Various Solr-related spellchecking parameters are exposed by the step (as advanced parameters). These parameters are:
Destination Field — Field to store the spellchecked result in; if left blank, the source field will be overwritten
Solr Index — Solr index containing the dictionary to spellcheck against
Request Handler — Solr request handler that handles the spellcheck query
Dictionary List — This is a specific Solr parameter, as there might be multiple dictionaries defined from an index
Accuracy — Accuracy threshold (value between 0 and 1) to be used by the spellchecking engine
External URL — External URL of Solr spellchecker to use
Spellcheck Parameters — any additional parameter used by Solr API as described at Spell Checking | Apache Solr Reference Guide 6.6
In order to test and fine tune the spellchecker, a Solr query can also be executed. There is a new Solr index named ‘Dictionary’ which contains English vocabulary. An example spellcheck query executed via the Solr admin UI using the Dictionary index is depicted below.
Â