Support for alternative forms

All OCR formats supported by this plugin have the possibility of encoding alternative readings for a given word. These can either come from the OCR engine itself and consist of other high-confidence readings for a given sequence of characters, or they could come from an manual or semi-automatic OCR correction system.

Note

For hOCR, use <span class="alternatives"><ins class="alt">...</ins><del class="alt">...</del></span> (see hOCR specification)
For ALTO, use <String …><ALTERNATIVE>...</ALTERNATIVE></String> (see AlternativeType in the ALTO schema)
For MiniOCR, delimit alternative forms with ⇿ (U+21FF) (see MiniOCR documentation)

In any case, these alternative readings can improve your user's search experience, by allowing us to index multiple forms for a given text position. This enables users to find more matching passages for a given query than if only a single form was indexed for every word. This is a form of index-time term expansion, similar in concept to e.g. the Synonym Graph Filter that ships with Solr.

To enable the indexing of alternative readings, you have to make some modifications to your OCR field's index analysis chain.

First, you need to enable alternative expansion in the OcrCharFilterFactory by setting the expandAlternatives attribute to true:

<charFilter
   class="solrocr.OcrCharFilterFactory"
   expandAlternatives="true"
/>

Next, you need to add a new OcrAlternativesFilterFactory token filter component to your analysis chain. This component must to be placed after the tokenizer:

<fieldType name="text_ocr" class="solr.TextField">
  <!-- .... -->
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solrocr.OcrAlternativesFilterFactory"/>
  <!-- .... -->
</fieldType>

A full field definition for an OCR field with alternative expansion could look like this:

<fieldType name="text_ocr" class="solr.TextField">
  <analyzer type="index">
    <charFilter class="solrocr.ExternalUtf8ContentFilterFactory"/>
    <charFilter
      class="solrocr.OcrCharFilterFactory"
      expandAlternatives="true"
    />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solrocr.OcrAlternativesFilterFactory"/<
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Highlighting matches on alternative forms

During highlighting, you will only see the matching alternative form in the snippet if the match is on a single word, or if it is at the beginning or the end of a phrase match. This is because we cannot get to the offsets of matching terms inside of a phrase match through Lucene's highlighting machinery.

Unsupported tokenizers

The OcrAlternativesFilterFactory works with almost all tokenizers shipping with Solr, except for the ClassicTokenizer. This is because we use the WORD JOINER (U+2060) character to denote alternative forms in the character stream and the classic tokenizer splits tokens on this character (contrary to Unicode rules). This also means that if you use a custom tokenizer, you need to make sure that it does not split tokens on U+2060.

Non-alphabetic characters in alternatives

Some of Solr's built-in tokenizers split tokens on special characters like - that occur inside of words. When such characters occur within tokens that have alternatives, the alternatives are severed from the original token and the plugin will not index them. To avoid this, either use a tokenizer that doesn't split on these characters (like WhitespaceTokenizerFactory) or consider customizing your tokenizer of choice to not split on these characters when a token includes alternative readings. Note that this can lead to less precise results, e.g. when alpha-numeric is not split, only a query like alphanumeric or alpha-numeric will match (depending on the analysis chains), but not alpha or numeric alone or a "alpha numeric" phrase query.

Consider increasing the standard maxTokenLength of 255

When your OCR contains a large number of alternatives for tokens, or these alternatives can get quite long, consider increasing the maximum token length in your tokenizer's configuration. For most of Solr's tokenizers this can be done with the maxTokenLength parameter that defaults to 255. When the plugin encounters a case where this leads to truncated alternatives, it will print a warning to the Solr log. Consider increasing the value to 512 or 1024. This will come at the expense of an increase in memory usage during indexing, but will preserve as many of your alternative readings as possible.