Supported Formats

In general the plugin assumes that all OCR formats encode their documents in a hierarchy of blocks. For all supported formats, we map their block types to these general types:

Page: optional if there is only a single page in a document
Block: optional if hl.ocr.limitBlock is set to a different value at query time
Section: optional
Paragraph: optional
Line: (optional if hl.ocr.contextBlock is set to a different value at query time)
Word: required

These block types can be used in the hl.ocr.limitBlock and hl.ocr.contextBlock query parameters to control how the plugin generates snippets.

hOCR

Block type mapping:

Block	hOCR class	notes
Word	`ocrx_word`	needs to have a `bbox` attribute with the coordinates on the page
Page	`ocr_page`	needs to have a page identifier, either in `id` attribute or in the `ppageno` or `x_source` entry in the `title` attribute, must be unique within a document!
Block	`ocr_carea`/`ocrx_block`
Section	`ocr_chapter`/`ocr_section`/ `ocr_subsection`/`ocr_subsubsection`
Paragraph	`ocr_par`
Line	`ocr_line` or `ocrx_line`

ALTO

For ALTO OCR, there are some requirements for the XML structure: - Must have ALTO as the default namespace and no namespace prefixes for the tags. - If using multiple ALTO files per document, the ID attribute of the <Page> tag must be unique within the document.

If those requirements are not met, you will encounter severe bugs and misbehavior in the highlighting phase, please make sure your inputs meet these requirements before you start indexing.

Caution

The coordinates returned by the plugin are not always pixel values, since ALTO supports a variety of different reference units for the coordinates. Check the <MeasurementUnit> value in your ALTO files, if its value is anything other than pixel, you will have to do some additional calculations on the client side to convert to pixel coordinates.

Block type mapping:

Block	ALTO tag	notes
Word	`<String />`	needs to have `CONTENT`, `HPOS`, `VPOS`, `WIDTH` and `HEIGHT` attributes
Line	`<TextLine />`
Block	`<TextBlock />`
Page	`<Page />`	needs to have an `ID` attribute with a page identifier, must be unique within a document
Section	not mapped
Paragraph	not mapped

MiniOCR

This plugin also includes support for a custom non-standard OCR format that we dubbed MiniOCR, designed to be very simple (and thus performant) to parse and to occupy the least space possible.

You should use this format when:

you want to store the OCR in the index (to keep the index size as low)
reusing the existing OCR files is not possible or practical (to keep occupied disk space low)
you want the best possible performance, highlighting MiniOCR is ~25% faster than ALTO and ~50% faster than hOCR (in an artificial benchmark that is purely CPU-bound)

A basic example looks like this:

<ocr>
  <p xml:id="page_identifier">
    <b>
      <l><w x="50 50 100 100">A</w> <w x="150 50 100 100">Line</w></l>
    </b>
  </p>
</ocr>

Alternatives for words can be encoded with the ⇿ (U+21FF) marker. For example, this is how you would encode a word with the default form clistrias and two alternatives christmas and christrias:

<w x="50 50 100 100">clistrias⇿christmas⇿christrias</w>

A command-line tool to convert from ALTO and hOCR to MiniOCR is provided in util/miniocr.py.

Block type mapping:

Block	MiniOCR tag	notes
Word	`<w/>`	needs to have `box` attribute with `{x} {y} {width} {height}`. Values can be integers or floats between 0 and 1, with the leading `0` omitted
Line	`<l/>`
Block	`<b/>`
Page	`<p/>`	needs to have an `xml:id` attribute with a page identifier. Optionally can have a `wh` attribute with the `{width} {height}` values for the page
Section	not mapped
Paragraph	not mapped