Supported Formats
In general the plugin assumes that all OCR formats encode their documents in a hierarchy of blocks. For all supported formats, we map their block types to these general types:
- Page: optional if there is only a single page in a document
- Block: optional if
hl.ocr.limitBlock
is set to a different value at query time - Section: optional
- Paragraph: optional
- Line: (optional if
hl.ocr.contextBlock
is set to a different value at query time) - Word: required
These block types can be used in the hl.ocr.limitBlock
and hl.ocr.contextBlock
query parameters to control how the plugin generates snippets.
hOCR
Block type mapping:
Block | hOCR class | notes |
---|---|---|
Word | ocrx_word |
needs to have a bbox attribute with the coordinates on the page |
Page | ocr_page |
needs to have a page identifier, either in id attribute or in the ppageno or x_source entry in the title attribute |
Block | ocr_carea /ocrx_block |
|
Section | ocr_chapter /ocr_section /ocr_subsection /ocr_subsubsection |
|
Paragraph | ocr_par |
|
Line | ocr_line or ocrx_line |
ALTO
Caution
The coordinates returned by the plugin are not always pixel values, since ALTO supports a variety
of different reference units for the coordinates. Check the <MeasurementUnit>
value in your ALTO
files, if its value is anything other than pixel
, you will have to do some additional calculations
on the client side to convert to pixel coordinates.
Block type mapping:
Block | ALTO tag | notes |
---|---|---|
Word | <String /> |
needs to have CONTENT , HPOS , VPOS , WIDTH and HEIGHT attributes |
Line | <TextLine /> |
|
Block | <TextBlock /> |
|
Page | <Page /> |
needs to have an ID attribute with a page identifier |
Section | not mapped | |
Paragraph | not mapped |
MiniOCR
This plugin also includes support for a custom non-standard OCR format that we dubbed MiniOCR, designed to be very simple (and thus performant) to parse and to occupy the least space possible.
You should use this format when:
- you want to store the OCR in the index (to keep the index size as low)
- reusing the existing OCR files is not possible or practical (to keep occupied disk space low)
- you want the best possible performance, highlighting MiniOCR is ~25% faster than ALTO and ~50% faster than hOCR (in an artificial benchmark that is purely CPU-bound)
A basic example looks like this:
<ocr>
<p xml:id="page_identifier">
<b>
<l><w x="50 50 100 100">A</w> <w x="150 50 100 100">Line</w></l>
</b>
</p>
</ocr>
Alternatives for words can be encoded with the ⇿
(U+21FF
) marker. For example, this is how you would
encode a word with the default form clistrias
and two alternatives christmas
and christrias
:
Block type mapping:
Block | MiniOCR tag | notes |
---|---|---|
Word | <w/> |
needs to have box attribute with {x} {y} {width} {height} . Values can be integers or floats between 0 and 1, with the leading 0 omitted |
Line | <l/> |
|
Block | <b/> |
|
Page | <p/> |
needs to have an xml:id attribute with a page identifier. Optionally can have a wh attribute with the {width} {height} values for the page |
Section | not mapped | |
Paragraph | not mapped |