Skip to content

Supported Formats

In general the plugin assumes that all OCR formats encode their documents in a hierarchy of blocks. For all supported formats, we map their block types to these general types:

  • Page: optional if there is only a single page in a document
  • Block: optional if hl.ocr.limitBlock is set to a different value at query time
  • Section: optional
  • Paragraph: optional
  • Line: (optional if hl.ocr.contextBlock is set to a different value at query time)
  • Word: required

These block types can be used in the hl.ocr.limitBlock and hl.ocr.contextBlock query parameters to control how the plugin generates snippets.

hOCR

Block type mapping:

Block hOCR class notes
Word ocrx_word needs to have a bbox attribute with the coordinates on the page
Page ocr_page needs to have a page identifier, either in id attribute or in the ppageno or x_source entry in the title attribute
Block ocr_carea/ocrx_block
Section ocr_chapter/ocr_section/
ocr_subsection/ocr_subsubsection
Paragraph ocr_par
Line ocr_line or ocrx_line

ALTO

Caution

The coordinates returned by the plugin are not always pixel values, since ALTO supports a variety of different reference units for the coordinates. Check the <MeasurementUnit> value in your ALTO files, if its value is anything other than pixel, you will have to do some additional calculations on the client side to convert to pixel coordinates.

Block type mapping:

Block ALTO tag notes
Word <String /> needs to have CONTENT, HPOS, VPOS, WIDTH and HEIGHT attributes
Line <TextLine />
Block <TextBlock />
Page <Page /> needs to have an ID attribute with a page identifier
Section not mapped
Paragraph not mapped

MiniOCR

This plugin also includes support for a custom non-standard OCR format that we dubbed MiniOCR, designed to be very simple (and thus performant) to parse and to occupy the least space possible.

You should use this format when:

  • you want to store the OCR in the index (to keep the index size as low)
  • reusing the existing OCR files is not possible or practical (to keep occupied disk space low)
  • you want the best possible performance, highlighting MiniOCR is ~25% faster than ALTO and ~50% faster than hOCR (in an artificial benchmark that is purely CPU-bound)

A basic example looks like this:

<ocr>
  <p xml:id="page_identifier">
    <b>
      <l><w x="50 50 100 100">A</w> <w x="150 50 100 100">Line</w></l>
    </b>
  </p>
</ocr>

Alternatives for words can be encoded with the (U+21FF) marker. For example, this is how you would encode a word with the default form clistrias and two alternatives christmas and christrias:

<w x="50 50 100 100">clistrias⇿christmas⇿christrias</w>

Block type mapping:

Block MiniOCR tag notes
Word <w/> needs to have box attribute with {x} {y} {width} {height}.
Values can be integers or floats between 0 and 1, with the leading 0 omitted
Line <l/>
Block <b/>
Page <p/> needs to have an xml:id attribute with a page identifier. Optionally can have a wh attribute with the {width} {height} values for the page
Section not mapped
Paragraph not mapped