hocr2djvused(1) hOCR to djvused script converter


hocr2djvused [option...] [hocr-file...]


hocr2djvused reads one or more m[blue]hOCRm[][1] files (as produced by m[blue]OCRopusm[][2] or m[blue]Cuneiformm[][3] or m[blue]Tesseractm[][4]) and converts them to a djvused script.

Unless a filename is explicitly provided on the command line, hOCR is read from the standard input.


Text segmentation options

-t lines, --details lines

Record location of every line. Don't record locations of particular words or characters.

-t words, --details=words

Record location of every line and every word. Don't record locations of particular characters.

This is the default.

-t chars, --details=chars

Record location of every line, every word and every character.


Consider each non-empty sequence of non-whitespace characters a single word.

This is the default, despite being linguistically incorrect.


Use the m[blue]Unicode Text Segmentationm[][5] algorithm to break lines into words.

This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.

Other options


Assume that DjVu pages are rotated by n degrees.


Specifies that page size is width pixels × height pixels.

This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.


Use a m[blue]HTML5 parserm[][6], which is more robust but slower than the default parser.


Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters.

This option might be needed for hOCR generated by Cuneiform[7] or Tesseract[8].


Output version information and exit.

-h, --help

Display help and exit.


Please report bugs at: m[blue]https://bitbucket.org/jwilk/ocrodjvu/issuesm[]