ocrmypdf(1) add an OCR text layer to PDF files


usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-n] [--flowchart FILE]
[-l LANGUAGE] [-j N] [--title TITLE] [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS] [-r] [-d] [-c] [-i] [--oversample DPI] [-f] [-s] [--skip-big MPixels] [--tesseract-config CFG] [--tesseract-pagesegmode PSM] [--pdf-renderer {auto,tesseract,hocr}] [--tesseract-timeout SECONDS] [--rotate-pages-threshold CONFIDENCE] [-k] [-g] input_file output_file

Generate searchable PDF file from an image-only PDF file.

positional arguments:

PDF file containing the images to be OCRed
output searchable PDF file

optional arguments:

-h, --help
show this help message and exit
-l LANGUAGE, --language LANGUAGE
languages of the file to be OCRed
-j N, --jobs N
Use up to N CPU cores simultaneously (default: use all)

Common options:

--verbose [VERBOSE], -v [VERBOSE]
Print more verbose messages for each additional verbose level.
show program's version number and exit

pipeline arguments:

-n, --just_print
Don't actually run any commands; just print the pipeline.
--flowchart FILE
Don't run any commands; just print pipeline as a flowchart.

Metadata options:

Set output PDF/A metadata (default: use input document's metadata)
--title TITLE
set document title (place multiple words in quotes)
--author AUTHOR
set document author
--subject SUBJECT
set document subject description
--keywords KEYWORDS
set document keywords

Preprocessing options:

Improve OCR quality and final image
-r, --rotate-pages
automatically rotate pages based on detected text orientation
-d, --deskew
deskew each page before performing OCR
-c, --clean
clean pages from scanning artifacts before performing OCR
-i, --clean-final
incorporate the cleaned image in the final PDF file
--oversample DPI
oversample images to at least the specified DPI, to improve OCR results slightly

OCR options:

Control how OCR is applied
-f, --force-ocr
rasterize any fonts or vector images on each page and apply OCR
-s, --skip-text
skip OCR on any pages that already contain text, but include the page in final output
--skip-big MPixels
skip OCR on pages larger than the specified amount of megapixels, but include skipped pages in final output


Advanced options for power users
--tesseract-config CFG
additional Tesseract configuration files
--tesseract-pagesegmode PSM
set Tesseract page segmentation mode (see tesseract --help)
--pdf-renderer {auto,tesseract,hocr}
choose OCR PDF renderer
--tesseract-timeout SECONDS
give up on OCR after the timeout, but copy the preprocessed page into the final output
--rotate-pages-threshold CONFIDENCE
only rotate pages when confidence is above this value (arbitrary units reported by tesseract)


Arguments to help with troubleshooting and debugging
-k, --keep-temporary-files
keep temporary files (helpful for debugging)
-g, --debug-rendering
render each page twice with debug information on second page