Blogs Archives - Red Oak Strategic

Keep Your OLAP Data Warehouse Up To Date via API Gateway, Amazon EventBridge, and Amazon Redshift

Written by Tyler Sanders | Jun 5, 2023 8:47:00 PM

 

As your data load and analytics requirements expand, so too can the risk of inefficient compute and Extract, Transform, Load (ETL) jobs. Since most modern enterprises require custom data sourcing and pipelines to analytics storage that does not break the bank, Amazon Web Services has a pattern that uses API Gateway and EventBridge to kick off data ingestion jobs that are fit for a Redshift OLAP database storage solution. 

In order to take advantage of the full power of Redshift, careful consideration of your data schema and use cases must be considered right from the start. The art and science of creating tables and join keys is hugely important. Once you have your data schema in place you are ready to import that data.

For many companies, some of the most value analytics data must routinely be fetched from outside APIs and other sources. API Gateway is the bridge that will allow your data ingestion jobs to successfully extract and load that data into your AWS VPC. An intermediate home on S3 is a terrific pattern because Redshift has a custom S3 import tool that makes imports of formatted S3 data lightning fast. 

Using an orchestrator like EventBridge to help automate that data ingestion is a great pattern because it will only cost the compute resources it spins up as needed. Finally, because you have a custom process running for import you can also set rules for Redshift maintenance tasks such as vacuuming tables to ensure the data is properly encoded and distributed across nodes to improve performance. 

Depending on your input volume this vacuum process is important to do weekly or even daily for import intensive processes. Patterns that ingest raw source data and efficiently house that data in Redshift are an incredibly powerful workhorse pattern for high volume analytics and modeling workflows. 

Deep Dive:

 

AWS API Gateway:

AWS API Gateway provides a managed service to create and publish RESTful APIs, making it easy to expose endpoints for data ingestion and integration.

 

AWS Lambda:

AWS Lambda enables serverless compute to execute code in response to events, allowing you to process and transform incoming data.

 

AWS EventBridge:

AWS EventBridge facilitates event-driven communication between AWS services, enabling the creation of decoupled and flexible workflows.

 

AWS Redshift:

Amazon Redshift is a fully managed data warehousing solution that allows you to analyze large volumes of data using SQL queries.

 

Architecture Overview:

  • Data Ingestion Endpoint: AWS API Gateway exposes endpoints that receive incoming data events, such as data uploads, sensor readings, or user interactions.

  • Event Routing: AWS EventBridge captures and routes these incoming events to appropriate event targets, including AWS Lambda functions.

  • Data Transformation: AWS Lambda functions process and transform the incoming data. This could involve data validation, enrichment, or aggregation.

  • Data Warehousing: Processed data is stored in Amazon Redshift, which acts as a central repository for performing SQL-based analytics.

  • Query and Analysis: Analysts and business users can use SQL queries to analyze the data stored in Redshift, generating insights and reports.

 

Advantages:

  • Real-time Data Ingestion: AWS API Gateway and Lambda facilitate real-time data ingestion, enabling timely analysis.

  • Event-Driven Flexibility: AWS EventBridge enables decoupled, event-driven workflows that adapt to changing data sources.

  • Serverless Processing: AWS Lambda provides serverless compute for data transformation, reducing operational overhead.

  • Scalable Analytics: Amazon Redshift scales to accommodate large datasets and complex queries.

 

Considerations:

  • Event Schema: Design a clear event schema to ensure consistent data processing and compatibility across components.

  • Data Modeling: Design an effective data model in Redshift to support efficient query performance.

  • Cost Management: Monitor and optimize costs by right-sizing resources and managing Lambda function execution times.

 

Documentation and Pricing: