Example Setup
The repository includes a full-fledged example setup based on the Google Books 1000 and the BNL L'Union Newspaper datasets. The Google Books dataset consists of 1000 Volumes along with their OCRed text in the hOCR format and all book pages as full resolution JPEG images. The BNL dataset consists of 2712 newspaper issues in the ALTO format and all pages as high resolution TIF images.
The example ships with a search interface that allows querying the OCRed texts and displays the matching passages as highlighted image and text snippets. We also include a small IIIF-Viewer that allows viewing the documents and searching for text within them.
Online version
A public instance of this example is available at https://ocrhl.jbaiter.de.
The Solr server can be queried at https://ocrhl.jbaiter.de/solr/ocr/select
, e.g. for
q="mason dixon"~10"
Prerequisites
To run the example setup yourself, you will need:
- Docker and
docker-compose
- Python 3
- ~15 GiB of free storage
Running the example
cd example
docker-compose up -d
./ingest.py
- Access
http://localhost:8181
in your browser
Search Frontend
IIIF Content Search
Solr Configuration Walkthrough
<config>
<luceneMatchVersion>9.0</luceneMatchVersion>
<!-- Load the plugin JAR from the contrib directory.
NOTE: Not needed when running with Solrcloud and Package Manager.
-->
<lib dir="../../../contrib/ocrsearch/lib" regex=".*\.jar" />
<!-- Define a search component that takes care of OCR highlighting -->
<searchComponent class="solrocr.OcrHighlightComponent"
name="ocrHighlight" />
<!-- Add the OCR Highlighting component to the request handler -->
<requestHandler name="/select" class="solr.SearchHandler">
<arr name="components">
<str>query</str>
<!--
Note that the OCR highlighting component comes **before**
the default highlighting component!
-->
<str>ocrHighlight</str>
<str>highlight</str>
</arr>
</requestHandler>
</config>
<fieldtype
name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true">
<analyzer type="index">
<charFilter
class="solrocr.ExternalUtf8ContentFilterFactory" />
<charFilter
class="solrocr.OcrCharFilterFactory" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>