SYNOPSIS
-
djvu2hocr [option...] djvu-file
- djvu2hocr {--version | --help | -h}
DESCRIPTION
OPTIONS
Input selection options
-p, --pages=page-range
-
Specifies pages to covert.
page-range
is a comma-separated list of sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous range of pages (e.g. 37-42). Pages are numbered from 1.
The default is to convert all pages.
Text segmentation options
--word-segmentation=simple
-
Use the same word segmentation as found in the DjVu file.
This is the default.
--word-segmentation=uax29
- Use the m[blue]Unicode Text Segmentationm[][2] algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file.
HTML output options
--title=title
-
Specifies the document title.
The default is "DjVu hidden text layer".
--css=style
-
Add the specified CSS style to the document.
For example, --css='.ocrx_line { display: block; }' can be used to visually preserve line breaks.
Other options
--version
- Output version information and exit.
-h, --help
- Display help and exit.
PORTABILITY
djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk: <span class="djvu_char" title="#x07"> </span>
BUGS
Please report bugs at: m[blue]https://bitbucket.org/jwilk/ocrodjvu/issuesm[]
NOTES
- 1.
- hOCR
- 2.
-
Unicode Text Segmentation
- m[blue]http://unicode.org/reports/tr29/m[]