Text Extraction Api Services - TExASe

TExASe is a flask application for document processing, namely Extraction of text from it (with application of OCR if necessary).

Most of the provided services require a file as an input.

In addition to this, TExASe offers support for repositories, through a set of services and functions that read document metadata, generate citation strings, and most notably, generate cover pages for the documents using the supplied metadata, configuration, html, css files and images.

List of services

Basic services

Repository services

Cover will also be added by using the above services (except extract), if create_first_page param is set to true in resprected repository config file (e.g. repos/default/config.json) and *nocover param is not set to false

Use GUI

Simply, on the left, pick a file and click process. The result will appear on the right.

Optionally, if you want a more precise OCR, input a (supported) language code for tesseract bellow.

Current GUI supports only ocr_and_extract service.

Use API

url

hostname/api/service_name

required arg

file : containing filename of a posted file.

optional args

Optional args are definied for each service in the config.json file. For example ocr service accepts lang (language code for tesseract ocr) parameter.

Options

Some of the options are configured in the config.json file. These include:

tesseract : tesseract service options, including a list of default languages to be used in OCR, path to tesseract service, as well as a list of applicable extensions.

log : string, file location of a writable log file. If the file is specified and writable, all requests will be logged in it.

wkhtmltopdf_path : string, path to executables for wkhtmltopdf, required for creation of PDFs from HTML strings.

default_repo : string, name of the default repository (folder with required files in the repos dir of this project). This will affect creation of the cover pages for submitted PDFs.

redis : redis server options. If on is set to 1, all requests will go through a redis queue on specified url and port. Redis is only available for services that do not return a value.

services a list of available services, and their respected parameter definitions.

Requirements

sudo apt-get install python3-dev libxml2-dev libxslt1-dev tesseract-ocr ghostscript

for ubuntu and

pip install flask redis rq pdfkit tika ocrmypdf pytesseract Pillow

from pip.