DESCRIPTION
usage: ocrmypdf [-h] [--verbose [VERBOSE]] [--version] [-n] [--flowchart FILE]- [-l LANGUAGE] [-j N] [--title TITLE] [--author AUTHOR] [--subject SUBJECT] [--keywords KEYWORDS] [-r] [-d] [-c] [-i] [--oversample DPI] [-f] [-s] [--skip-big MPixels] [--tesseract-config CFG] [--tesseract-pagesegmode PSM] [--pdf-renderer {auto,tesseract,hocr}] [--tesseract-timeout SECONDS] [--rotate-pages-threshold CONFIDENCE] [-k] [-g] input_file output_file
Generate searchable PDF file from an image-only PDF file.
positional arguments:
- input_file
- PDF file containing the images to be OCRed
- output_file
- output searchable PDF file
optional arguments:
- -h, --help
- show this help message and exit
- -l LANGUAGE, --language LANGUAGE
- languages of the file to be OCRed
- -j N, --jobs N
- Use up to N CPU cores simultaneously (default: use all)
Common options:
- --verbose [VERBOSE], -v [VERBOSE]
- Print more verbose messages for each additional verbose level.
- --version
- show program's version number and exit
pipeline arguments:
- -n, --just_print
- Don't actually run any commands; just print the pipeline.
- --flowchart FILE
- Don't run any commands; just print pipeline as a flowchart.
Metadata options:
- Set output PDF/A metadata (default: use input document's metadata)
- --title TITLE
- set document title (place multiple words in quotes)
- --author AUTHOR
- set document author
- --subject SUBJECT
- set document subject description
- --keywords KEYWORDS
- set document keywords
Preprocessing options:
- Improve OCR quality and final image
- -r, --rotate-pages
- automatically rotate pages based on detected text orientation
- -d, --deskew
- deskew each page before performing OCR
- -c, --clean
- clean pages from scanning artifacts before performing OCR
- -i, --clean-final
- incorporate the cleaned image in the final PDF file
- --oversample DPI
- oversample images to at least the specified DPI, to improve OCR results slightly
OCR options:
- Control how OCR is applied
- -f, --force-ocr
- rasterize any fonts or vector images on each page and apply OCR
- -s, --skip-text
- skip OCR on any pages that already contain text, but include the page in final output
- --skip-big MPixels
- skip OCR on pages larger than the specified amount of megapixels, but include skipped pages in final output
Advanced:
- Advanced options for power users
- --tesseract-config CFG
- additional Tesseract configuration files
- --tesseract-pagesegmode PSM
- set Tesseract page segmentation mode (see tesseract --help)
- --pdf-renderer {auto,tesseract,hocr}
- choose OCR PDF renderer
- --tesseract-timeout SECONDS
- give up on OCR after the timeout, but copy the preprocessed page into the final output
- --rotate-pages-threshold CONFIDENCE
- only rotate pages when confidence is above this value (arbitrary units reported by tesseract)
Debugging:
- Arguments to help with troubleshooting and debugging
- -k, --keep-temporary-files
- keep temporary files (helpful for debugging)
- -g, --debug-rendering
-
render each page twice with debug information on
second page