An open source optical character recognition (OCR) platform.
Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text.
Running tesseract on RCC Resources#
To run tesseract on the HPC, you can directly run the command from the terminal as it does not require loading an
environment module. In the example below, simply replace
outputbase with your filenames.
The options and config file content are all listed out on the GitHub page.