The repository includes a full-fledged example setup based on the Google Books 1000 and the BNL L’Union Newspaper datasets. The Google Books dataset consists of 1000 Volumes along with their OCRed text in the hOCR format and all book pages as full resolution JPEG images. The BNL dataset consists of 2712 newspaper issues in the ALTO format and all pages as high resolution TIF images.

The example ships with a search interface that allows querying the OCRed texts and displays the matching passages as highlighted image and text snippets. We also include a small IIIF-Viewer that allows viewing the documents and searching for text within them.

Online version

A public instance of this example is available at

The Solr server can be queried at, e.g. for q="mason dixon"~10"


To run the example setup yourself, you will need:

  • Docker and docker-compose
  • Python 3
  • ~15 GiB of free storage

Running the example

  1. cd example
  2. docker-compose up -d
  3. ./
  4. Access http://localhost:8181 in your browser

Search Frontend

Search Frontend

IIIF Viewer with Content Search

Solr Configuration Walkthrough


  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>
  <schemaFactory class="ClassicIndexSchemaFactory"/>

  <!-- Load the plugin JAR from the contrib directory -->
  <lib dir="../../../contrib/ocrsearch/lib" regex=".*\.jar" />

  <!-- Define a search component that takes care of OCR highlighting -->
  <searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent"
                   name="ocrHighlight" />

  <!-- Add the OCR Highlighting component to the request handler -->
  <requestHandler name="/select" class="solr.SearchHandler">
    <arr name="components">
        Note that the OCR highlighting component comes **before**
        the default highlighting component!


<fieldtype name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.ExternalUtf8ContentFilterFactory" />
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>