Skip to main content

Optical Character Recognition (OCR)

Overview

Optical Character Recognition (OCR) extracts text and structured data from images, scanned PDFs, and other supported document formats. In FormKiQ, OCR is usually run as a document action, workflow step, or part of an intelligent document processing flow.

OCR output can support:

  • Full-text search
  • Text extraction from scanned documents
  • Key-value extraction from forms
  • Table extraction
  • Textract query-based extraction
  • Metadata mapping into document attributes
  • Workflow routing and downstream automation

Where OCR Fits

OCR is one part of the document processing pipeline.

StepWhat happens
1. Document is addedA document is uploaded, imported, or referenced.
2. OCR action runsFormKiQ extracts text or structured data using Tesseract or Amazon Textract.
3. OCR result is storedOCR text, tables, key-values, status, and output files can be retrieved through the API.
4. Mappings can runExtracted OCR output can be mapped into document attributes.
5. Search and workflows use the outputOCR text can be indexed, and mapped attributes can drive rules, workflows, reporting, or access patterns.

For mappings, see Mappings. For document processing actions, see Documents.

OCR Providers

FormKiQ supports Tesseract and Amazon Textract.

ProviderBest forAvailability
TesseractBasic text extraction, simple scanned documents, lower-cost OCR processing.Available in FormKiQ Core and all editions.
Amazon TextractHigher-accuracy extraction, forms, tables, key-values, handwriting, and queries.Available with Explore, commercial deployments, and optional OCR/IDP modules.

Tesseract

Tesseract is an open-source OCR engine. It is useful for straightforward text extraction when you do not need advanced form, table, or key-value analysis.

Use Tesseract when:

  • You need basic OCR for scanned documents.
  • Cost sensitivity matters more than advanced extraction.
  • Documents are relatively clean and simple.
  • You only need text output for search or review.

Amazon Textract

Amazon Textract is an AWS machine learning service for OCR and document analysis. It is better suited to structured and semi-structured documents.

Use Textract when:

  • You need forms or key-value extraction.
  • You need table extraction.
  • You need query-based extraction.
  • You need better handling of complex layouts.
  • You are processing invoices, forms, records, or other structured business documents.

For advanced processing, see Enhanced Document OCR and Classification.

Supported Inputs and Outputs

Supported input formats depend on the engine, document conversion path, installed modules, and AWS service support in the target region.

Common OCR inputs include:

  • PDF
  • JPEG
  • PNG
  • TIFF
  • Scanned image files
  • Office documents when conversion support is available in the deployment

Common OCR outputs include:

OutputDescription
TextExtracted text from the document.
Key-valuesLabel/value pairs from forms or structured documents.
TablesExtracted table rows and cells.
CSVTable output exported to CSV when configured.
Content URLsPresigned URLs for OCR output files.
OCR statusProcessing state such as requested, successful, failed, or skipped.

Confirm supported formats and module behavior for your deployment before relying on OCR for production ingestion.

OCR Action Parameters

OCR is commonly configured through document action parameters.

{
"type": "OCR",
"parameters": {
"ocrEngine": "TEXTRACT",
"ocrParseTypes": "TEXT,TABLES",
"ocrOutputType": "CSV",
"ocrNumberOfPages": "10",
"addPdfDetectedCharactersAsText": "true"
}
}
ParameterDescriptionNotes
ocrEngineOCR engine to use.TESSERACT or TEXTRACT.
ocrParseTypesTypes of data to extract.TEXT, FORMS, TABLES, QUERIES.
ocrTextractQueriesQuestions to ask Textract when using QUERIES.Required when ocrParseTypes includes QUERIES.
ocrOutputTypeOutput conversion for supported results.CSV is used for Textract table output.
ocrNumberOfPagesNumber of pages to OCR from the start of the document.-1 processes all pages.
addPdfDetectedCharactersAsTextRewrites PDF image text into searchable text where supported.Useful for scanned PDFs.

Use the generated API reference for exact request shape and supported values.

Textract Parse Types

Amazon Textract can extract different kinds of information depending on the parse type.

Parse typeExtractsUse for
TEXTLines and words.Searchable text, basic extraction, downstream full-text indexing.
FORMSKey-value pairs.Invoices, applications, forms, and structured records.
TABLESTable structure and cell values.Statements, reports, tabular forms, and CSV export.
QUERIESAnswers to configured questions.Targeted extraction when field labels vary or documents are less structured.

When using QUERIES, provide ocrTextractQueries with the questions Textract should answer.

OCR Results

OCR results can be retrieved from the document OCR API after processing completes.

Result data can include:

  • Extracted text
  • Key-value pairs
  • Tables
  • Output content URLs
  • OCR engine used
  • OCR status
  • User ID
  • Document ID
  • Inserted date

The OCR status is useful for monitoring and retry workflows. Failed OCR actions should be surfaced in operational reports or queues so they can be retried or reviewed.

For endpoint details, see Get document OCR content.

OCR output becomes more useful when combined with mappings and search.

FeatureHow OCR output is used
MappingsTurns OCR text or key-value output into document attributes such as invoiceNumber, vendorName, or documentDate.
Full-text searchIndexes extracted text so scanned documents can be searched.
WorkflowsRoutes documents based on mapped OCR values or processing status.
RulesetsRuns conditional automation after OCR or attribute extraction.
ReportingTracks OCR volume, failures, extracted fields, and processing trends.

For related details, see Mappings, Search, Rulesets, and Workflows.

Common Use Cases

Searchable Scanned Documents

Run OCR on scanned PDFs or images so users can search for text that was previously locked inside image content.

Invoice and Form Processing

Use Textract forms, tables, or queries to extract invoice numbers, vendors, totals, dates, line items, and other structured fields.

Metadata Extraction

Use OCR with mappings to populate document attributes used by search, reporting, workflows, and compliance processes.

Table Extraction

Use Textract table extraction and optional CSV output for statements, reports, forms, or documents with tabular data.

Workflow Automation

Trigger workflows after OCR completes, or route documents based on attributes populated from OCR output.

Best Practices

Choose the Right Engine

Use Tesseract for basic text extraction. Use Textract when document structure matters, especially forms, tables, handwriting, or query-based extraction.

Prepare Documents for OCR

OCR quality depends heavily on source quality.

Recommended practices:

  • Use clean scans.
  • Use 300 DPI where possible.
  • Avoid skewed or rotated pages.
  • Remove unnecessary backgrounds.
  • Prefer high-contrast images.
  • Test with real production samples.

Limit Page Volume When Appropriate

Use ocrNumberOfPages when only the first few pages matter. This can reduce cost and processing time for long documents.

Use Specific Parse Types

Only request the parse types you need. For example, use TEXT for searchable text, add FORMS for key-values, and add TABLES only when table extraction is needed.

Monitor Failures

Track OCR action status and failed OCR requests. Failed actions may indicate unsupported file types, poor image quality, missing module configuration, regional AWS service issues, or document conversion problems.

Plan for Cost

OCR cost depends on page count, engine, parsing method, retries, and document volume. Textract-based processing should be estimated separately from core storage and API usage.

For cost guidance, see Costs and AWS Usage.

API Operations

Use the generated API reference for exact request and response schemas.

OperationPurposeAPI reference
Add OCR actionRun OCR as a document action.POST /documents/{documentId}/actions
Perform document OCRRequest OCR directly for a document.POST /documents/{documentId}/ocr
Get OCR resultRetrieve OCR text, tables, key-values, status, or output URLs.GET /documents/{documentId}/ocr
Set OCR resultSet OCR result data for a document.PUT /documents/{documentId}/ocr
Delete OCR resultDelete OCR result data for a document.DELETE /documents/{documentId}/ocr
Add workflowConfigure OCR as part of a workflow.POST /workflows
Set workflowUpdate workflow OCR action configuration.PUT /workflows/{workflowId}
Get configurationReview site OCR configuration limits.GET /sites/{siteId}/configuration
Update configurationUpdate site OCR configuration limits.PATCH /sites/{siteId}/configuration

Where to Go Next