Example Setup

The repository includes a full-fledged example setup based on the Google Books 1000 and the BNL L'Union Newspaper datasets. The Google Books dataset consists of 1000 Volumes along with their OCRed text in the hOCR format and all book pages as full resolution JPEG images. The BNL dataset consists of 2712 newspaper issues in the ALTO format and all pages as high resolution TIF images.

The example ships with a search interface that allows querying the OCRed texts and displays the matching passages as highlighted image and text snippets. We also include a small IIIF-Viewer that allows viewing the documents and searching for text within them.

Online version

A public instance of this example is available at https://ocrhl.jbaiter.de.

The Solr server can be queried at https://ocrhl.jbaiter.de/solr/ocr/select, e.g. for q="mason dixon"~10"

Prerequisites

To run the example setup yourself, you will need:

Docker and docker-compose
Python 3
~15 GiB of free storage

Running the example

cd example
docker-compose up -d
./ingest.py
Access http://localhost:8181 in your browser

Search Frontend

IIIF Content Search

Solr Configuration Walkthrough

solrconfig.xml

<config>
  <luceneMatchVersion>9.0</luceneMatchVersion>

  <!-- Load the plugin JAR from the contrib directory.
       NOTE: Not needed when running with Solrcloud and Package Manager.
  -->
  <lib dir="../../../contrib/ocrsearch/lib" regex=".*\.jar" />

  <!-- Define a search component that takes care of OCR highlighting -->
  <searchComponent class="solrocr.OcrHighlightComponent"
                   name="ocrHighlight" />

  <!-- Add the OCR Highlighting component to the request handler -->
  <requestHandler name="/select" class="solr.SearchHandler">
    <arr name="components">
      <str>query</str>
      <!--
        Note that the OCR highlighting component comes **before**
        the default highlighting component!
      -->
      <str>ocrHighlight</str>
      <str>highlight</str>
    </arr>
  </requestHandler>
</config>

schema.xml

<fieldtype
    name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true">
  <analyzer type="index">
    <charFilter
      class="solrocr.ExternalUtf8ContentFilterFactory" />
    <charFilter
      class="solrocr.OcrCharFilterFactory" />
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldtype>