Skip to content

Introduction

In financial services, accuracy is everything. A single misplaced digit in a cash flow statement or a mistyped account value can cascade into reporting discrepancies, flawed projections, and poor client decisions. Despite advances in financial technology, many operational workflows still depend on a fragile foundation: a PDF is uploaded, someone reads through it, and key values are manually typed into a system. The process works until scale exposes its weaknesses.

Manual data entry is slow, expensive, and inherently vulnerable to human error. For firms processing hundreds or thousands of documents each month, even a small error rate becomes a measurable operational risk. Worse, manual transcription introduces inconsistencies that are difficult to audit after the fact.

At Red Oak Strategic, we partnered with a financial services firm to modernize this workflow by automating structured data extraction from PDF documents using AWS Textract. The goal was not simply to extract text, but to reliably convert semi-structured financial documents into validated, structured data that downstream systems could trust

The Challenge: PDFs Are Not Structured Data

PDFs remain the dominant format for onboarding forms, investment summaries, portfolio statements, and compliance documentation. While these documents appear digital, many are image-based scans or flattened exports. From a system perspective, they are essentially pictures.

Traditional document parsing tools rely on embedded text layers. When those layers are absent, they fail entirely. Even when text exists, it is rarely stored in a way that preserves business meaning. Layout information, spacing, and visual grouping are not inherently machine-readable.

This is why many organizations fall back to Optical Character Recognition, otherwise known as OCR.

The Limits of Traditional OCR

Traditional OCR systems focus on character recognition. They analyze pixel patterns and convert them into letters and numbers. That process alone is already probabilistic. Lighting conditions, scan resolution, font variation, compression artifacts, and skew all affect accuracy.

But even when OCR correctly recognizes every character, it still lacks structural awareness.

It does not inherently understand that:

  • A number aligned to the right of “Total Revenue” is the value associated with that label
  • A row in a financial table represents a single logical entity
  • A header defines the semantic meaning of values beneath it
  • Two visually separated sections should not be merged

Instead, traditional OCR often returns a flat stream of text or loosely ordered lines. Developers are then forced to reconstruct meaning using brittle heuristics such as positional coordinates, keyword proximity, or manual template matching.

In finance, this becomes especially problematic. Documents often contain:

  • Repeating field names across different sections
  • Tables with merged cells or multi-row headers
  • Conditional sections that appear only for certain clients
  • Subtle layout shifts between reporting periods

Basic OCR treats all of this as text. It does not reason about relationships.

As a result, traditional OCR pipelines frequently require extensive post-processing logic and still struggle when layouts change even slightly. What initially looks automated quietly becomes another maintenance burden.

Textract: Structure, Relationships, and Context

AWS Textract improves upon traditional OCR by incorporating document structure analysis into the extraction process. Instead of returning a simple text transcript, Textract returns a graph of “blocks.” Each block represents a structural component of the document, such as a word, line, key-value pair, table, cell, or query result.

More importantly, these blocks contain relationship metadata. A value block references its corresponding key block. A table cell references its parent row and column. A query result references the question that produced it.

This relational model is what distinguishes Textract from basic OCR.

Rather than asking developers to infer structure from coordinates, Textract provides an explicit representation of the document’s logical organization. That dramatically reduces the amount of guesswork required to reconstruct meaning.

Figure 1: Textract extracting tables out of the document

Figure 2: Textract extracting forms out of the document (key value pairs)

Textract’s Forms feature detects key-value pairs by identifying spatial and semantic relationships. It understands that “Net Operating Income” is a label and that the number aligned with it is the associated value. This remains robust even when the layout shifts slightly between documents.

Its Tables feature identifies row and column boundaries, preserving tabular structure even in complex financial statements. Instead of manually slicing text by coordinates, we can traverse table blocks and reconstruct structured rows programmatically.

Perhaps most powerful is Textract Queries. Queries allow us to define specific questions that the model attempts to answer from within the document. Rather than extracting everything and filtering afterward, we can directly ask for the fields that matter most to our application. Textract returns the best-matching answer along with a confidence score, enabling more deterministic pipelines even when templates vary.

Figure 3: Textract using queries to extract target values

Designing the Automated Pipeline

We architected a fully serverless ingestion pipeline to convert uploaded PDFs into structured financial data.

When a user uploads a document through the web interface, it is stored in Amazon S3. That storage event triggers an AWS Lambda function, which invokes Textract with forms, tables, and queries enabled.

Textract returns a structured block graph. Our Lambda function then parses this graph by traversing relationships to reconstruct key-value mappings, table rows, and query answers. Rather than relying on positional heuristics alone, we use the relational metadata provided by Textract to build deterministic extraction logic.

However, even advanced OCR output requires normalization before becoming business-ready.

Financial documents introduce formatting variability such as commas in large numbers, parentheses for negative values, inconsistent spacing, and currency symbols. We implemented cleaning layers that standardize numeric formats, remove extraneous characters, and validate expected patterns using regular expressions.

Textract also provides a confidence score for each detected element. We incorporated these scores into our validation workflow. Fields falling below predefined thresholds are flagged for review rather than automatically persisted. This creates a safety net without sacrificing efficiency.

The cleaned and validated output is serialized as structured JSON and written back to S3. From there, it is consumed by downstream systems and persisted in a database for further processing and reporting.

Screenshot 2026-03-16 at 2.49.40 PM

Engineering for Reliability at Scale

While Textract provides structural awareness, production reliability still depends on engineering discipline.

Financial documents are rarely uniform. Some PDFs bundle multiple forms. Others include optional sections that appear only under certain conditions. Field names may be reused in different contexts.

Because Textract exposes structure explicitly, we can write parsing logic that navigates relationships rather than assuming fixed positions. This dramatically improves resilience when layouts shift.

We also implemented layered safeguards. Regex validation catches predictable OCR artifacts such as character substitutions or malformed numeric values. Confidence thresholds prevent uncertain data from silently propagating. A human-in-the-loop review process handles exceptions that pass automated filters.

The result is not blind automation. It is controlled automation.

The Operational Impact

By moving from manual transcription and traditional OCR toward structured extraction with Textract, the client reduced manual entry time and improved data integrity.

The key improvement was not simply higher character accuracy. It was contextual accuracy. Values are now programmatically tied to the correct fields, reducing the risk of misalignment or field drift.

Processing capacity scales elastically with document volume. Instead of hiring additional staff to keep pace with growth, the organization relies on a serverless pipeline that expands automatically under load.

Most importantly, the system is auditable. Extracted values are stored alongside confidence scores and validation results, providing traceability that manual entry workflows often lack.

Lessons Learned

Traditional OCR is a useful starting point, but it is not sufficient for complex financial documents where structure defines meaning. Textract’s relational model, query capability, and structural detection provide a far stronger foundation.

However, technology alone does not create reliability. Production-grade document automation requires thoughtful parsing, normalization, validation, and human oversight.

When implemented correctly, document extraction shifts financial teams away from transcription and toward analysis. It reduces operational risk while improving speed and scalability.

For organizations still buried under PDF-driven workflows, structured extraction is not just a convenience. It is infrastructure modernization. And in finance, infrastructure determines trust.

Contact Red Oak Strategic

Ready to get started?


Kickstart your cloud and data transformation journey with a complimentary conversation with the Red Oak team.