SYNOPSIS
- hocr2djvused [option...] [hocr-file...]
DESCRIPTION
Unless a filename is explicitly provided on the command line, hOCR is read from the standard input.
OPTIONS
Text segmentation options
-t lines, --details lines
- Record location of every line. Don't record locations of particular words or characters.
-t words, --details=words
-
Record location of every line and every word. Don't record locations of particular characters.
This is the default.
-t chars, --details=chars
- Record location of every line, every word and every character.
--word-segmentation=simple
-
Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
--word-segmentation=uax29
-
Use the
m[blue]Unicode Text Segmentationm[][5]
algorithm to break lines into words.
This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended.
Other options
--rotation=n
- Assume that DjVu pages are rotated by n degrees.
--page-size=widthxheight
-
Specifies that page size is
width
pixels ×
height
pixels.
This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.
--html5
- Use a m[blue]HTML5 parserm[][6], which is more robust but slower than the default parser.
--fix-utf8
-
Attempt to fix UTF-8 encoding issues and eliminate unwanted control characters.
This option might be needed for hOCR generated by Cuneiform[7] or Tesseract[8].
--version
- Output version information and exit.
-h, --help
- Display help and exit.
BUGS
Please report bugs at: m[blue]https://bitbucket.org/jwilk/ocrodjvu/issuesm[]
NOTES
- 1.
- hOCR
- 2.
-
OCRopus
- m[blue]https://code.google.com/p/ocropus/m[]
- 3.
-
Cuneiform
- m[blue]https://launchpad.net/cuneiform-linuxm[]
- 4.
-
Tesseract
- m[blue]https://code.google.com/p/tesseract-ocr/m[]
- 5.
-
Unicode Text Segmentation
- m[blue]http://unicode.org/reports/tr29/m[]
- 6.
- HTML5 parser
- 7.
- m[blue]https://bugs.launchpad.net/cuneiform-linux/+bug/585418m[]
- 8.
-
m[blue]https://code.google.com/p/tesseract-ocr/issues/detail?id=690m[]