In a world of nearly unlimited data possibilities - it is both funny and often incredibly frustrating - that so much valuable data and information is stuck in static, hard to access PDF files. While extracting this data, particularly tabular and form data remains a huge pain with traditional OCR tools - AWS offers next generation tools for image data extraction that are unlocking this information.
Amazon Textract is an API-based service that impressed me from the first time I used it. Textract identifies, extracts, and formats tabular PDF data, as well as key value pairs on information and text across many kinds of forms and files, including PDFs and image files. The pattern for unlocking Textract is simple too - just upload all of your files to an S3 bucket and then call the API while listing the bucket.
For when even Textract doesn’t meet your accuracy needs, AWS offers a native integration to Amazon Augmented AI. This service will pass Textract results below a certain accuracy/confidence threshold to human reviewers - either internal organizations users or certified Mechanical Turk human reviewers.
While the price point and data formatting friction remain a challenge even with this pattern - the potential for new data sources and business models that can leverage this pattern are very compelling!
Deep Dive:
AWS S3:
Amazon S3 offers scalable and durable object storage, providing a secure repository for storing unstructured documents.
Amazon Textract:
Amazon Textract is a machine learning service that automatically extracts text, forms, tables, and other structured data from scanned documents, images, and PDFs.
Amazon Augmented AI (A2I):
Amazon A2I enables human review of machine learning predictions to ensure accuracy and reliability. It integrates seamlessly with Textract to validate and improve extracted data.
Architecture Overview:
-
Document Storage: Unstructured documents, such as scanned invoices or contracts, are stored in Amazon S3 buckets.
-
Text Extraction: Amazon Textract processes the documents and extracts relevant text, forms, and tables, converting unstructured data into structured information.
-
Machine Learning Predictions: Textract's machine learning models automatically analyze the documents and generate data extraction predictions.
-
Human Review Integration: Amazon A2I is invoked when Textract's confidence scores fall below a certain threshold or when specific document sections require validation.
-
Human Review Workflow: A2I routes tasks to human reviewers who validate and correct Textract's predictions through a user-friendly interface.
-
Data Enrichment: Human-reviewed results are stored, and corrections are used to fine-tune Textract's models, improving accuracy over time.
Advantages:
-
Automated Data Extraction: Textract automates the extraction of structured data from unstructured documents, saving time and reducing errors.
-
Human Validation: A2I enables human reviewers to verify and correct predictions, ensuring high-quality results.
-
Scalability: The combination of Textract and A2I scales to handle large volumes of documents and adapt to various use cases.
-
Improved Accuracy: Human feedback from A2I enhances Textract's accuracy and precision over time.
Considerations:
-
Document Quality: Ensure that documents are of sufficient quality for accurate extraction by Textract.
-
Human Review Workflow: Define clear guidelines and workflows for human reviewers to ensure consistent and accurate validation.
-
Model Training: Regularly update and retrain Textract models based on human-reviewed results to continuously improve accuracy.
Documentation and Pricing:
-
Amazon S3: Documentation, Pricing
-
Amazon Textract: Documentation, Pricing
-
Amazon Augmented AI (A2I): Documentation, Pricing