Requirements
- Some familiarity with configuring Solr
- Solr ≥ 7.5
- OCR documents need to be in hOCR, ALTO or MiniOCR formats, with at least page-, and word-level segmentation
Manually installing the plugin JAR
To use the latest release version, refer to the GitHub Releases list. From there, download the corresponding JAR file.
To make the plugin available to Solr, create a new directory $SOLR_HOME/contrib/ocrhighlighting/lib
and place the JAR you just downloaded there.
For Solrcloud users: Installation as a Solr Package
Since version 8.4, Solrcloud ships with a package management subsystem that can be used to conveniently install plugins from the command-line. To install the OCR highlighting plugin in this way, follow these steps on one of the nodes in your Solrcloud cluster. All paths are relative to the Solr installation directory:
- Add repository to the local package registry:
$ ./bin/solr package add-repo dbmdz.github.io https://dbmdz.github.io/solr
- Install package in the latest version:
$ ./bin/solr package install ocrhighlighting
if you're on Solr 9, otherwise:$ ./bin/solr package install ocrhighlighting:0.8.1-solr78
Be sure to use the ocrhighlighting:
prefix when specifying classes in your configuration.
When using the Package Manager, classes from plugins have to be prefixed (separated by a colon) by
their plugin's identifier, for this plugin this identifier is ocrhighlighting
. So whenever
you see an attribute like class="solrocr.SomeClass"
, you have to write
class="ocrhighlighting:solrocr.SomeClass"
in your config instead.
Core Configuration
To enable the use of the plugin for your Solr core, you will have to edit
both the solrconfig.xml
and the schema.xml
file in your core's conf
directory.
Additionally, if you have installed the plugin via Solr's Package Management, you will have to deploy the plugin to your collection/core using Solr's CLI:
SolrConfig
In your core's solrconfig.xml
, you need to:
- Enable the plugin for your collection/core by instructing the collection from where to load the plugin classes (Skip when using Solrcloud with Package Manager )
- Define a search component that will perform the OCR highlighting at query time
- Add the search component to your request handlers that will trigger the highlighting.
<config>
<!-- ...other configuration options... -->
<!-- Only needed when not using the Package Management:
Tell Solr to load all JAR files from the directory installed the plugin to.
This assumes a directory structure where the cores are in
`$SOLR_HOME/server/solr/$CORE` and the plugin JAR was installed in
`$SOLR_HOME/contrib/ocrhighlighting/lib`. Adjust the path if you setup differs.
-->
<lib dir="../../../contrib/ocrhighlighting/lib" regex=".*\.jar" />
<!-- Add a new named search component that takes care of highlighting OCR field
values.
NOTE: Add the `ocrhighlighting:` prefix if using Package Management.
-->
<searchComponent
class="solrocr.OcrHighlightComponent"
name="ocrHighlight" />
<!-- ...other search components... -->
<!--
Instruct the request handlers you want to enable OCR highlighting for to
include the search component you defined above. This example uses the
standard /select handler.
CAUTION: Make sure that the OCR highlight component is listed **before** the
standard highlighting component, but **after** the query component.
-->
<requestHandler name="/select" class="solr.SearchHandler">
<arr name="components">
<str>query</str>
<str>ocrHighlight</str>
<str>highlight</str>
</arr>
</requestHandler>
</config>
If you run into problems, a look into these sections of the Solr user's guide might be helpful:
Schema
In the core's schema.xml
, you need to:
- Define a new field type that will hold your indexed OCR text
- Define which fields are going to hold the indexed OCR text.
The field type for OCR text is very similar to your regular text field, with the difference that there are one or two extra character filters at the beginning of your index analysis chain:
-
ExternalUtf8ContentFilterFactory
will (optionally) allow you to index and highlight OCR from external sources on the file system. More on this in the Indexing chapter. -
OcrCharFilterFactory
will retrieve the raw OCR data and extract the plain text that is going to pass through the rest of the analysis chain. It will auto-detect the used OCR formats, which means that you can use different OCR formats alongside each other. After this filter, Solr will treat the field just like a regular text field for purposes of analysis.
Additionally, you need to enable the storeOffsetsWithPositions
option. The plugin uses these
offsets to locate the matching terms in the OCR documents.
<schema>
<types>
<fieldtype
name="text_ocr"
class="solr.TextField"
storeOffsetsWithPositions="true">
<!-- NOTE: When not using the Package Manager, add the `ocrhighlighting:`
prefix to the `class` attributes -->
<analyzer type="index">
<!-- For loading external files as field values during indexing -->
<charFilter
class="solrocr.ExternalUtf8ContentFilterFactory" />
<!-- For converting OCR to plaintext -->
<charFilter
class="solrocr.OcrCharFilterFactory" />
<!-- ...rest of your index analysis chain... -->
</analyzer>
<analyzer type="query">
<!-- query analysis chain, should not include the character filters -->
</analyzer>
</fieldtype>
</types>
<fields>
<!-- ...your other fields ... -->
<!-- A field that uses the OCR field type. Has to be `stored`. -->
<field name="ocr_text" type="text_ocr" multiValued="false" indexed="true"
stored="true" />
</fields>
</schema>
If you struggle with setting up your schema, a look into the Schema Design chapter of the Solr user's guide might be helpful.
No support for multi-valued fields
Due to certain limitations in Lucene/Solr, it is currently not possible to use multi-valued fields for OCR highlighting. You can work around this by leveraging some of the advanced features of source pointers, though.