Skip to main content

Build an OCR Searchable Archive

What You Will Build

In this tutorial, you will build a searchable archive for scanned or image-based documents. You will upload a document, run OCR, store searchable full-text content, add structured attributes, and query the archive.

This workflow combines:

  • Document upload
  • OCR processing
  • Full-text indexing
  • Attributes
  • Full-text search
  • Raw OpenSearch querying

Before You Begin

Confirm you have:

  • A deployed FormKiQ environment.
  • OCR and enhanced full-text search capabilities enabled for your environment.
  • A JWT access token with permission to create, update, and search documents. See Get a JWT Authentication Token.
  • cURL or an API client such as Postman.
  • Optional: jq for formatting JSON responses.

Variables Used

PlaceholderDescription
HTTP_API_URLFormKiQ API endpoint from the CloudFormation stack output, including https://.
AUTHORIZATION_TOKENJWT access token used in the Authorization header.
SITE_IDFormKiQ site ID. Use default unless your deployment uses multiple sites.
DOCUMENT_IDDocument ID returned when the archive document is uploaded.

The examples below use shell variables. Replace the values before running the commands:

export HTTP_API_URL="https://your-formkiq-api.example.com"
export AUTHORIZATION_TOKEN="your-jwt-access-token"
export SITE_ID="default"

Workflow Overview

  1. Upload a document with archive metadata.
  2. Run OCR on the document.
  3. Retrieve OCR content.
  4. Add full-text content and attributes.
  5. Search the archive with searchFulltext.
  6. Use queryFulltext for an advanced OpenSearch query.

Step 1: Upload an Archive Document

Use POST /documents to create a sample archive document.

curl -X POST "${HTTP_API_URL}/documents?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"path": "archive/invoice-1001.pdf",
"contentType": "application/pdf",
"content": "Invoice 1001 from Acme Corporation for cloud infrastructure services.",
"tags": [
{
"key": "archive",
"value": "true"
}
],
"attributes": [
{
"key": "documentType",
"stringValues": ["invoice"],
"valueType": "STRING"
},
{
"key": "department",
"stringValues": ["finance"],
"valueType": "STRING"
}
]
}'

Set DOCUMENT_ID to the returned document ID.

export DOCUMENT_ID="returned-document-id"

Step 2: Run OCR

Use POST /documents/{documentId}/ocr to request OCR processing.

curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"ocrEngine": "TEXTRACT",
"parseTypes": ["TEXT", "FORMS", "TABLES"],
"ocrNumberOfPages": "-1"
}'

OCR processing may run asynchronously depending on the file and module configuration.

Step 3: Retrieve OCR Content

Use GET /documents/{documentId}/ocr to inspect extracted content.

curl -X GET "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}"

If the response does not include text yet, wait and retry.

Step 4: Add Full-Text Content and Attributes

Use POST /documents/{documentId}/fulltext to store content and structured fields for search.

curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/fulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"contentType": "text/plain",
"content": "Invoice 1001 from Acme Corporation for cloud infrastructure services.",
"path": "archive/invoice-1001.pdf",
"tags": [
{
"key": "archive",
"value": "true"
}
],
"attributes": {
"documentType": {
"stringValues": ["invoice"],
"valueType": "STRING"
},
"department": {
"stringValues": ["finance"],
"valueType": "STRING"
}
}
}'

Step 5: Search the Archive

Use POST /searchFulltext for common search patterns across content, tags, and attributes.

curl -X POST "${HTTP_API_URL}/searchFulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": {
"text": "cloud infrastructure",
"tags": [
{
"key": "archive",
"eq": "true"
}
],
"attributes": [
{
"key": "documentType",
"eq": {
"stringValue": "invoice"
}
}
]
},
"responseFields": {
"attributes": ["documentType", "department"],
"tags": ["archive"]
}
}'

Step 6: Run an Advanced Query

Use POST /queryFulltext when your application needs raw OpenSearch DSL.

curl -X POST "${HTTP_API_URL}/queryFulltext?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"query": {
"bool": {
"must": [
{
"match": {
"content": "cloud infrastructure"
}
}
]
}
}
}'

Use searchFulltext first for common application search. Use queryFulltext when you need lower-level OpenSearch control.

Verify the Result

Confirm that:

  • The document exists in FormKiQ.
  • OCR content is available from GET /documents/{documentId}/ocr.
  • searchFulltext returns the document by content and filters.
  • queryFulltext returns the document for the raw OpenSearch query.

Clean Up

Delete the test document if it is no longer needed.

curl -X DELETE "${HTTP_API_URL}/documents/${DOCUMENT_ID}?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}"

Troubleshooting

ProblemLikely causeWhat to check
OCR output is emptyOCR has not completed or the source file has no extractable text.Wait and retry, then test with a clearer PDF or image.
Textract OCR failsTextract support is not enabled or lacks AWS permissions.Check OCR module configuration.
Search returns no resultsFull-text indexing has not completed.Wait, retry, or reindex the document.
Raw OpenSearch query failsThe request body is invalid for OpenSearch.Start with a simple match query.

Next Steps