Build an OCR Searchable Archive
What You Will Build
In this tutorial, you will build a searchable archive for scanned or image-based documents. You will upload a document, run OCR, store searchable full-text content, add structured attributes, and query the archive.
This workflow combines:
- Document upload
- OCR processing
- Full-text indexing
- Attributes
- Full-text search
- Raw OpenSearch querying
Before You Begin
Confirm you have:
- A deployed FormKiQ environment.
- OCR and enhanced full-text search capabilities enabled for your environment.
- A JWT access token with permission to create, update, and search documents. See Get a JWT Authentication Token.
- cURL or an API client such as Postman.
- Optional: jq for formatting JSON responses.
Variables Used
| Placeholder | Description |
|---|---|
HTTP_API_URL | FormKiQ API endpoint from the CloudFormation stack output, including https://. |
AUTHORIZATION_TOKEN | JWT access token used in the Authorization header. |
SITE_ID | FormKiQ site ID. Use default unless your deployment uses multiple sites. |
DOCUMENT_ID | Document ID returned when the archive document is uploaded. |
The examples below use shell variables. Replace the values before running the commands:
export HTTP_API_URL="https://your-formkiq-api.example.com"
export AUTHORIZATION_TOKEN="your-jwt-access-token"
export SITE_ID="default"
Workflow Overview
- Upload a document with archive metadata.
- Run OCR on the document.
- Retrieve OCR content.
- Add full-text content and attributes.
- Search the archive with
searchFulltext. - Use
queryFulltextfor an advanced OpenSearch query.
Step 1: Upload an Archive Document
Use POST /documents to create a sample archive document.
curl -X POST "${HTTP_API_URL}/documents?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"path": "archive/invoice-1001.pdf",
"contentType": "application/pdf",
"content": "Invoice 1001 from Acme Corporation for cloud infrastructure services.",
"tags": [
{
"key": "archive",
"value": "true"
}
],
"attributes": [
{
"key": "documentType",
"stringValues": ["invoice"],
"valueType": "STRING"
},
{
"key": "department",
"stringValues": ["finance"],
"valueType": "STRING"
}
]
}'
Set DOCUMENT_ID to the returned document ID.
export DOCUMENT_ID="returned-document-id"
Step 2: Run OCR
Use POST /documents/{documentId}/ocr to request OCR processing.
curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"ocrEngine": "TEXTRACT",
"parseTypes": ["TEXT", "FORMS", "TABLES"],
"ocrNumberOfPages": "-1"
}'
OCR processing may run asynchronously depending on the file and module configuration.
Step 3: Retrieve OCR Content
Use GET /documents/{documentId}/ocr to inspect extracted content.
curl -X GET "${HTTP_API_URL}/documents/${DOCUMENT_ID}/ocr?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}"
If the response does not include text yet, wait and retry.