Skip to content

tesseract

An open source optical character recognition (OCR) platform.


Tesseract is an open source optical character recognition (OCR) platform. OCR extracts text from images and documents without a text layer and outputs the document into a new searchable text file, PDF, or most other popular formats. Tesseract is highly customizable and can operate using most languages, including multilingual documents and vertical text.

Running tesseract on RCC Resources#

To run tesseract on the HPC, you can directly run the command from the terminal as it does not require loading an environment module. In the example below, simply replace imagename and outputbase with your filenames.

$ tesseract imagename outputbase [options...] [configfile...]

The options and config file content are all listed out on the GitHub page.