Demystifying the AWS Datalake

AWS provides a rich ecosystem of services to employ in delivering data products. In fact, the number of services are so large that it can be daunting for an organisation to navigate the landscape.

One of the guiding principles of cloud based Datalakes is the separation of compute resources from storage resources. In general, the set of storage options are much easier to choose between:

  1. Object Storage (S3)
    1. objects of any size – though how these are structured can matter
    2. any type – binary, text, images, video
    3. cheap to employ and mechanisms exist to control data life cycles further reducing costs
    4. Appropriately encoded data can be queried in a structured manner
  2. Key-Value/Document Stores (DynamoDB, DocumentDB, OpenSearch)
    1. Smaller payloads
    2. Semi-structured/unstructured data
    3. Can handle large volumes and many changes to individual items
    4. Limited query capabilities
  3. Databases (RDS, Aurora)
    1. Structured data
    2. Rich query interface
    3. Active constraint and integrity enforcement
    4. Transactional
  4. Data Warehouses (Redshift)
    1. Structured data
    2. Optimised for analytical queries (aggregations on large tables
    3. Not transactional
    4. No enforcement of integrity

Given the intended purpose of a Datalake is to act as a repository of all information across an organisation, to centralise data from many heterogeneous systems, to break down data silos and provide a central authority on what data is available and at what quality, a Datalake needs the maximum flexibility to store the data. As such, object storage is the typical choice. Another common alternative is to centralise an organisations most frequently accessed data in a Data Warehouse with data archived onto object storage, with integration (Redshift spectrum) between the Warehouse (Redshift) and object storage (S3) allowing for access to archived data on object storage from within the warehouse.

Less commonly or indirectly used storage options for Datalakes and data products:

  1. File Storage (EFSFSx)
    1. Supports file system access semantics (read, write, seek, append, etc)
  2. Block storage (EBS)
    1. Backings for file systems and file system images.

While there are several choices for data storage for AWS data platforms, as noted, their use cases are clear and consensus around their usage has been largely established. However, how data is acted on, transformed, loaded, moved, cleaned, and enriched continues to be a difficult question to answer. In many cases it depends on the properties of the analysis that the data is required for, in others how the data is captured and by whom determines the tool choice, in some cases the volume of data is the primary decision factor and in others the velocity determines the mechanism. Regardless, there are many options for doing work with your data on the AWS cloud. In fact, with the recent release of EMR serverless and the continued evolution of Redshift these decisions continue to get progressively more difficult with the range of options, seemingly endlessly, growing.

Datalake compute is typically separated into 2 tiers:

  1. Orchestration
  2. Transformation

Orchestration defines the schedules, orders, dependencies, and dispatch of how data moves through the Datalake. Typically this is where retry and recovery are managed. Since this component is the centralised driver of activity within the Datalake, it is generally separated and isolated from the systems that implement the transform. The orchestration machinery is also responsible for updating catalogues, maintaining data lineage definitions, updating freshness reporting tools and other metadata management tasks.

Transformation, on the other hand, concerns itself with the low level operations on the data itself: renaming and restructuring data, joining tables together, converting formats, filtering and aggregations.

The intent of separating orchestration from transformation is to ensure the uptime and continued operation of the orchestration which will in turn manage, monitor and attempt to recover from failures of the transforms. Also, this separation allows for these two distinct activities to be scaled and optimised independently. The orchestration scales according to the number and interconnectedness of datasets and transformations within the Datalake. The transform, themselves, however, scale according to the data sizes, the transformations required, and the layouts and formats the data is persisted in on object storage.

Typical tool choices for orchestration on the AWS cloud include:

  1. AWS Glue Workflows/Blueprints
  2. Lambda Step Functions
  3. Amazon Managed Workflows for Apache Airflow
  4. Custom solutions using EventBridgeSNSS3 events & Lambda

This is without making mention of the plethora of third party orchestration and ETL management software available as SaaS solutions and on the AWS marketplace.

Common options utilised for transforming the data itself include:

  1. Athena CREATE TABLE AS (CTAS) queries
  2. AWS Glue Spark or PySpark jobs
  3. AWS Glue Elastic Views
  4. PigHadoopSqoop, Spark jobs running on EMR, EMR serverless or EMR on AWS Fargate
  5. Spark jobs running on Amazon EKS
  6. AWS Lambda functions
  7. S3 Batch Operations
  8. AWS Batch jobs
  9. Traditional ETL scripts on EC2
  10. Amazon SageMaker Data Wrangler
  11. AWS Data Pipelines
  12. AWS Glue Databrew
  13. Redshift Views
  14. AWS Database Migration Service
  15. Amazon Kinesis Data Firehose Transformations

This list is by no means comprehensive: it fails to address streaming workloads; there is no mention of third party offerings; it doesn’t look more deeply at higher level tools like dbt that compile to views and stored procedures and significantly reduce the effort to build certain transformations.

These lists only provide an overview, some of the landmarks on the AWS data services map. There are many considerations necessary to ensure a successful deployment: what skills exist in your organisation, what DevOps and deployment practices exist, the quality and completeness of the integration between your orchestration tool and the transformation processes utilised, how the storage tiers are configured and data formatted to maximise efficiency and minimise costs. The interplay between the people, the orchestration, transformation and the business processes around data initiatives are all critical to success.

With this endless array of alternatives and combinations for data engineering teams to contend with, it is hardly surprising that many stick to tools and techniques that they are familiar with from on-premise and in so doing miss out on the advantages cloud native data solutions offer while many others wallow in analysis paralysis while others still battle against unstructured adoption of the tools without a longer term strategy. While there is no one-size-fits-all answer, there are well trodden pathways and best practices
that a partner like TechConnect can utilise to accelerate your successful delivery of an organisational wide data program. With many successful data projects in our history and the hard won learnings that come from delivering real world solutions, TechConnect can create and evolve a strategy for extracting value from your data and share in a long term relationship to help grow the data capabilities of your organisation into the future.