Skip to main content

OCR and Full-Text Search Pipeline

Use this guide to run OCR on a document, inspect OCR text, add full-text content, and search the document with full-text search.

This workflow is useful when scanned files or image-based PDFs need to become searchable.

Before You Begin

Confirm you have:

  • A deployed FormKiQ environment. See Quick Start.
  • A JWT access token with permission to read and update documents.
  • cURL or an API client such as Postman.
  • A document uploaded to FormKiQ. See Add Documents.
  • OCR and enhanced full-text search modules enabled if your workflow uses Textract or OpenSearch-backed search.

Variables Used

PlaceholderDescription
HTTP_API_URLFormKiQ API endpoint from the CloudFormation stack output, including https://.
AUTHORIZATION_TOKENJWT access token used in the Authorization header.
SITE_IDFormKiQ site ID. Use default unless your deployment uses multiple sites.
DOCUMENT_IDDocument ID to process and search.

The examples below use shell variables. Replace the values before running the commands:

export HTTP_API_URL="https://your-formkiq-api.example.com"
export AUTHORIZATION_TOKEN="your-jwt-access-token"
export SITE_ID="default"
export DOCUMENT_ID="your-document-id"

Step 1: Run OCR

Use POST /documents/{documentId}/ocr to request OCR processing.

curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"ocrEngine": "TEXTRACT",
"parseTypes": ["TEXT", "FORMS", "TABLES"],
"ocrNumberOfPages": "-1"
}'

Use TEXTRACT when the AWS Textract-backed module is enabled. Use TESSERACT only when your deployment is configured for it.

Step 2: Retrieve OCR Content

Use GET /documents/{documentId}/ocr to inspect OCR output.

curl -X GET "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}"

OCR may run asynchronously. If content is not available immediately, wait and retry.

Step 3: Add Full-Text Content

Use POST /documents/{documentId}/fulltext to add searchable full-text content and optional search attributes.

curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/fulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"contentType": "text/plain",
"content": "Invoice 1001 from Acme Corporation for cloud services.",
"path": "invoice-1001.pdf",
"attributes": {
"documentType": {
"stringValues": ["invoice"],
"valueType": "STRING"
}
},
"tags": [
{
"key": "source",
"value": "ocr"
}
]
}'

Step 4: Search Full Text

Use POST /searchFulltext for common full-text search patterns.

curl -X POST "${HTTP_API_URL}/searchFulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": {
"text": "invoice cloud services",
"attributes": [
{
"key": "documentType",
"eq": {
"stringValue": "invoice"
}
}
],
"tags": [
{
"key": "source",
"eq": "ocr"
}
]
},
"responseFields": {
"attributes": ["documentType"],
"tags": ["source"]
}
}'

Step 5: Use a Raw OpenSearch Query

Use POST /queryFulltext when you need raw OpenSearch query DSL.

curl -X POST "${HTTP_API_URL}/queryFulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": {
"match": {
"content": "invoice"
}
}
}'

Verify the Result

Confirm that:

  • OCR content is returned by GET /documents/{documentId}/ocr.
  • Full-text content is searchable with POST /searchFulltext.
  • Raw OpenSearch queries return the expected document when using POST /queryFulltext.

Troubleshooting

ProblemLikely causeWhat to check
OCR result is emptyOCR processing has not completed or the document has no extractable text.Retry after processing and test with a clear PDF or image.
Textract request failsThe Textract-backed module is not installed or configured.Check OCR module installation and AWS permissions.
Full-text search returns no resultsOpenSearch indexing has not completed.Wait, reindex the document, or check OpenSearch module health.
Raw query failsThe OpenSearch DSL body is invalid.Test a minimal match query first.

Next Steps