Skip to main content

Building Scalable Solutions Using FormKiQ

What You Will Build

In this tutorial, you will design a scalable AWS-native document processing solution using FormKiQ as the document layer. The result is a reference architecture and implementation plan for applications that need high-volume document ingestion, metadata control, asynchronous processing, search, governance, and operational visibility.

This is an architecture tutorial rather than a single endpoint walkthrough. It shows how to combine the smaller FormKiQ workflows into a customer-ready solution pattern.

Scalable FormKiQ Reference Architecture

Before You Begin

Confirm you have:

  • A deployed FormKiQ environment. See Quick Start.
  • Access to the FormKiQ API endpoint from the CloudFormation stack outputs.
  • A JWT access token or API key for API testing. See Get a JWT Authentication Token.
  • A basic understanding of Amazon S3, DynamoDB, Lambda, EventBridge, CloudWatch, and dead-letter queues.
  • Optional: a FormKiQ Core deployment with Typesense and Tesseract enabled if you want open-source full-text search and basic OCR.
  • Optional: commercial modules enabled if your solution requires Amazon Textract, Amazon OpenSearch Service, malware scanning, document generation, OPA, or document versioning.

Variables Used

PlaceholderDescription
HTTP_API_URLFormKiQ API endpoint from the CloudFormation stack output, including https://.
AUTHORIZATION_TOKENJWT access token or API key used in the Authorization header.
SITE_IDFormKiQ site ID. Use default unless your deployment uses multiple sites.
DOCUMENT_IDDocument ID returned when a test document is uploaded.

The examples below use shell variables. Replace the values before running the commands:

export HTTP_API_URL="https://your-formkiq-api.example.com"
export AUTHORIZATION_TOKEN="your-jwt-access-token-or-api-key"
export SITE_ID="default"
export DOCUMENT_ID="your-document-id"

What This Tutorial Does Not Build

This tutorial does not deploy a complete application stack, benchmark your production workload, or replace environment-specific architecture review. Use it as the design baseline for a scalable implementation, then validate the design with your own document sizes, document volumes, security requirements, edition/module choices, and AWS service limits.

The examples show both Core-compatible and commercial-module patterns. FormKiQ Core can use Tesseract for OCR and Typesense for text search when those components are enabled. Commercial offerings can add Amazon Textract for advanced OCR/IDP and Amazon OpenSearch Service for enhanced full-text search and raw OpenSearch queries.

Workflow Overview

  1. Choose the document ingestion pattern.
  2. Normalize documents at the FormKiQ boundary.
  3. Model metadata for scale.
  4. Move expensive work into asynchronous processing.
  5. Add search and retrieval paths.
  6. Add security and governance controls.
  7. Add reliability and operating controls.
  8. Validate the architecture before production traffic.

Step 1: Choose the Document Ingestion Pattern

Start by matching the ingestion pattern to the workload. Avoid forcing every use case through the same endpoint.

FormKiQ Ingestion Patterns

PatternBest forUse when
Inline POST /documentsSmall text or JSON documentsThe request payload is small and the application needs a simple synchronous create.
Presigned uploadBrowser uploads, large files, mobile appsThe file can be uploaded directly to S3 after FormKiQ creates the document record.
FileSync CLI or batch importMigration, high-volume loading, repeated importsThe source is a filesystem, S3 location, CSV, or migration export.
Server-side proxyCustom validation, product-specific authorization, internal APIsYou need your own application API between clients and FormKiQ.
Public intake or webhooksExternal partners, forms, integration callbacksYou need controlled intake from outside your authenticated application.

For scalable systems, prefer presigned uploads for large files. The application asks FormKiQ for an upload URL, then the client sends bytes directly to S3. That keeps the API request path short and avoids routing large files through your own application servers.

When FormKiQ returns a presigned upload URL, send any returned S3 headers exactly as returned. Missing or changed headers are a common cause of upload failures, especially when content type, checksum, encryption, or other S3 request headers are part of the signature.

Step 2: Normalize Documents at the Boundary

Every document should enter the platform with enough structure to make downstream processing predictable.

Capture these fields consistently:

  • path: stable logical location, not just the original filename.
  • contentType: accurate MIME type used for routing and processing.
  • documentId: supplied by FormKiQ or generated by your application for idempotent imports.
  • siteId: tenant or environment boundary.
  • Tags: simple key-value routing and lightweight metadata.
  • Attributes: typed, validated, searchable business metadata.
  • Actions: explicit processing instructions such as OCR, full-text, malware scan, EventBridge, or webhook.

Do not rely on downstream processors to infer all business context from the file. Add the information you already know at ingestion time.

Step 3: Model Metadata for Scale

Use metadata intentionally. Tags, attributes, schemas, and mappings solve related but different problems.

Metadata typeUse forScale guidance
TagsSimple labels, routing flags, legacy key-value lookupGood for lightweight filters and ruleset conditions.
AttributesTyped business metadata such as customer ID, invoice date, policy number, departmentPreferred for structured metadata and validation.
SchemasRequired attributes and allowed valuesUse when teams need consistent metadata across document classes.
Composite keysDynamoDB-backed multi-field search patternsDefine for high-volume exact-match access paths.
Full-text fieldsContent indexed by Typesense or OpenSearch, depending on deploymentUse for natural-language search, content discovery, and broader filtering.

Design metadata around the questions users and systems will ask later:

  • Which documents belong to this customer, account, claim, case, or project?
  • Which documents are waiting for review?
  • Which documents failed processing?
  • Which documents must be retained or held?
  • Which documents need to be searchable by text?

If a query is business-critical and high-volume, model it explicitly. Avoid depending on broad scans or ad hoc manual filtering.

Example Metadata Model

For an invoice archive, a scalable metadata model might use:

FieldTypePurpose
customerIdAttribute, stringFinds all documents for a customer.
documentTypeAttribute, stringSeparates invoices, contracts, correspondence, and forms.
documentDateAttribute, string or date-like stringSupports timeline and retention workflows.
reviewStatusAttribute or tagTracks whether a document is pending, approved, rejected, or failed.
sourceSystemTagIdentifies migration, portal, API, partner, or FileSync origin.

A high-volume exact lookup could be modeled as a composite key such as:

{
"compositeKeys": [
{
"key": "customerDocumentLookup",
"attributeKeys": ["customerId", "documentType", "documentDate"]
}
]
}

Use this pattern only for access paths that users or integrations actually need. Composite keys improve specific exact-match search patterns, but unused keys add schema complexity without improving retrieval.

Step 4: Move Expensive Work Asynchronously

Scalable document systems should not do every task during upload. Upload should make the document durable and visible; expensive processing should happen asynchronously.

Use FormKiQ actions and workflow components for:

  • OCR and text extraction.
  • Full-text indexing.
  • Malware scanning.
  • Metadata extraction.
  • Document tagging.
  • EventBridge publication.
  • Webhook callbacks.
  • Human review queues and approvals.

Example: add an EventBridge action after upload.

curl -X POST "${HTTP_API_URL}/documents/${DOCUMENT_ID}/actions?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"actions": [
{
"type": "EVENTBRIDGE",
"parameters": {
"eventBusName": "formkiq-document-pipeline"
}
}
]
}'

Example: add Core-compatible OCR and full-text actions during document creation. This uses Tesseract for OCR and the configured full-text engine, such as Typesense in a Core deployment.

curl -X POST "${HTTP_API_URL}/documents?siteId=${SITE_ID}" \
-H "Authorization: ${AUTHORIZATION_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"path": "intake/invoice-1001.pdf",
"contentType": "application/pdf",
"content": "Sample invoice content",
"actions": [
{
"type": "OCR",
"parameters": {
"ocrEngine": "TESSERACT",
"ocrNumberOfPages": "-1"
}
},
{
"type": "FULLTEXT"
}
],
"attributes": [
{
"key": "documentType",
"stringValues": ["invoice"],
"valueType": "STRING"
}
]
}'

For commercial deployments with Textract enabled, use TEXTRACT when you need forms, tables, queries, handwriting, or higher-accuracy structured extraction.

{
"type": "OCR",
"parameters": {
"ocrEngine": "TEXTRACT",
"ocrParseTypes": "TEXT,FORMS,TABLES",
"ocrNumberOfPages": "-1"
}
}

OCR and full-text actions are asynchronous. Full-text search may lag behind upload and OCR completion, and behavior depends on installed modules and processing configuration. For user-facing applications, expose processing status separately from upload status and make search availability eventually consistent.

Step 5: Add Rulesets and Workflows for Business Routing

Use rulesets when documents should be routed based on content type, tags, attributes, barcode values, or extracted fields. Use workflows and queues when humans or downstream systems need to make decisions.

Common scalable routing patterns:

  • Route invoices to finance review.
  • Route failed OCR documents to an exception queue.
  • Route high-value claims to a specialist queue.
  • Route documents with missing attributes to a metadata cleanup queue.
  • Trigger EventBridge for documents that require external enrichment.

Rulesets and workflows make routing declarative. That helps avoid burying business routing rules inside custom Lambda code.

Step 6: Add Search and Retrieval Paths

Use the simplest search path that satisfies the product requirement.

Search needRecommended path
Find by document IDGET /documents/{documentId}
Find by exact tag or attributePOST /search
Find by modeled multi-field keyComposite keys in the site schema
Search extracted text in Core with Typesense enabledPOST /search with query.text
Search extracted text with enhanced OpenSearch modulePOST /searchFulltext
Advanced OpenSearch filtering, sorting, and aggregationsPOST /queryFulltext

Typesense-backed search can be a good Core option for straightforward full-text search. OpenSearch-backed full-text search is available through commercial offerings and is a better fit when you need richer filtering, raw query DSL, advanced tuning, or OpenSearch operations. Neither search engine is the right answer for every lookup. Keep exact operational lookups modeled through attributes, tags, schemas, and composite keys.

Step 7: Add Reliability Controls

High-volume systems need observable failure paths. A scalable solution should make it clear whether a document is uploaded, processed, indexed, failed, retried, or waiting for review.

FormKiQ Processing and Reliability Pattern

Build these controls early:

  • Inspect document action status with GET /documents/{documentId}/actions.
  • Retry failed actions with POST /documents/{documentId}/actions/retry.
  • Monitor CloudWatch logs for API, Lambda, and processing failures.
  • Monitor DLQs and configure alerts.
  • Use idempotent document IDs for imports and retries where possible.
  • Track document processing state with tags or attributes when your application needs a user-facing status.
  • Reindex documents after metadata or search configuration changes.

Operational visibility is part of the architecture, not an afterthought.

Step 8: Add Security and Governance

Scale also means scaling access control and governance. Decide these boundaries before production rollout.

ConcernFormKiQ capability
User authenticationJWT authentication through Cognito or SSO
Server-side integrationsAPI keys or IAM-authorized API access
Tenant isolationSites and siteId-aware requests
Role-based accessUsers, groups, site permissions, folder permissions
Fine-grained policyOpen Policy Agent module
Retention and recoverySoft delete, purge, backup and recovery
Version controlDocument versioning module
Legal preservationLegal hold
Audit reviewUser activity and document activity endpoints
EncryptionAWS-managed or customer-managed KMS keys, with full encryption options where required
Network boundaryCustomer AWS account deployment, VPC patterns, private access, and controlled integration paths
Data residencyAWS Region selection and tenant/site placement aligned to residency requirements

For multi-tenant applications, make siteId part of the application model rather than a late-stage query parameter. Tenant-aware code should set the site boundary before it calls FormKiQ.

For regulated or enterprise deployments, clarify early who controls the AWS account, keys, regions, backups, and operational access. These decisions affect security review, procurement, implementation timelines, and support responsibilities.

Step 9: Validate the Architecture

Before increasing traffic, validate the design with a realistic test batch.

Use this checklist:

  • Upload small and large documents.
  • Confirm presigned upload flows work from the client environment.
  • Confirm required metadata is present immediately after upload.
  • Confirm rulesets route documents to the expected workflows or queues.
  • Confirm OCR, full-text, malware scan, or EventBridge actions complete for the engine/module you installed.
  • Confirm failed actions can be retried.
  • Confirm DLQ alerts are configured and actionable.
  • Confirm search results appear after indexing delay.
  • Confirm users without permissions cannot access restricted documents.
  • Confirm activity records support the audit questions your customer will ask.
  • Confirm expected request rates against API Gateway, Lambda, S3, Typesense or OpenSearch, Tesseract or Textract, and any downstream services.
  • Confirm throttling, retry, and backoff behavior under representative batch sizes.

Document the expected timing for each asynchronous stage. Customers are more successful when they know which steps are immediate and which are eventually consistent.

Step 10: Production Readiness Checklist

Use this as the final review before launch.

AreaProduction question
IngestionAre large files uploaded directly to S3 instead of through app servers?
MetadataAre required fields validated through schemas or application logic?
SearchAre exact-match and full-text search paths intentionally separated?
ProcessingAre expensive tasks asynchronous and retryable?
ReliabilityAre DLQs, logs, and action failures monitored?
SecurityAre user, group, API key, and folder permissions documented?
TenancyIs every request scoped to the correct siteId?
GovernanceAre legal hold, delete, purge, and audit policies understood?
CostAre OCR, search engine, storage, Lambda, and transfer costs expected at target volume?
LimitsHave AWS service quotas and FormKiQ module limits been reviewed for expected peak load?
Load testingHas the architecture been tested with representative file sizes, metadata volume, and search patterns?
SupportCan operators answer "where is this document and what happened to it?"

Common Architecture Decisions

DecisionPrefer this whenWatch for
Core vs commercial modulesCore covers the document API, storage, metadata, Tesseract OCR, and Typesense search when enabled.Textract, OpenSearch, antivirus, OPA, versioning, document generation, and support needs may require commercial modules.
Single site vs multi-siteOne business boundary can share users, permissions, and metadata conventions.Use multiple sites when tenant isolation, data residency, or operational separation matters.
Tags vs attributesTags are enough for lightweight labels and routing.Use attributes for typed, validated, and schema-driven business metadata.
DynamoDB-backed search vs Typesense vs OpenSearchExact tag/attribute lookups are the main retrieval path.Use Typesense for Core text search; use OpenSearch for enhanced full-text, flexible filtering, and advanced query patterns.
JWT vs API key vs IAMUsers are interacting through an application or console.Use API keys or IAM for trusted server-side automation; avoid exposing API keys in browsers.
Synchronous API vs async actionsThe task must complete immediately and is fast.Move OCR, malware scan, full-text, webhooks, and external enrichment into async actions.

Reference Implementation Paths

If you need to build...Start with
A human approval processBuild a Document Review and Approval Workflow
A searchable archive for scanned documentsBuild an OCR Searchable Archive
An external processing pipelineBuild an Event-Driven Document Processing Pipeline
A tenant-aware applicationMulti-Tenant Users and Multi-Tenant and Multi-Instance Deployments
A server-owned integration layerUsing a Server-Side Proxy and Manage API Keys

Verify the Result

You have a scalable design when:

  • Uploads remain fast as file size and volume increase.
  • Processing can continue even when OCR, search indexing, or external systems are slower than uploads.
  • Failed work is visible and retryable.
  • Metadata supports the most important user and system queries.
  • Security boundaries are explicit and testable.
  • Operators can inspect document status without reading application code.

Clean Up

This tutorial creates an architecture plan rather than a fixed stack. Clean up any resources you created while validating the pattern:

  • Test documents and uploaded files.
  • Document actions used for OCR, full-text, malware scan, EventBridge, or webhooks.
  • Test rulesets, rules, workflows, and queues.
  • EventBridge buses, rules, targets, Lambda functions, and Lambda permissions.
  • Test API keys, users, groups, folder permissions, and webhook configurations.
  • OpenSearch test indexes, snapshots, or restore jobs.
  • Temporary CloudWatch alarms, log exports, and DLQ redrive tests.

Troubleshooting

ProblemLikely causeWhat to check
Uploads slow down under loadFiles are being proxied through application servers or expensive processing is synchronous.Use presigned uploads and move processing into actions or event-driven workers.
Search results are missingIndexing is asynchronous or the wrong search path is used.Wait for indexing, reindex if needed, and use /search for DynamoDB/Typesense or /searchFulltext//queryFulltext for OpenSearch.
Processing status is unclearActions and logs are not part of the operating model.Add action inspection, CloudWatch log review, DLQ alerts, and retry runbooks.
Metadata becomes inconsistentDifferent ingestion paths use different field names or value formats.Normalize metadata at the boundary and use schemas for required attributes.
Tenants can see wrong dataRequests are not consistently scoped by siteId.Treat siteId as part of the application tenancy model and test cross-tenant access.
Costs grow unexpectedlyOCR engine, search engine, storage, or retries are higher than expected.Review Costs & AWS Usage and test with representative document volume.

Next Steps