The concept of a data lake is not new, but with the proliferation and adoption of cloud providers the capacity for many companies to adopt the model has exploded. A data lake is a centralised store for all kinds of business data:
- unstructured – images, videos, PDFs, Word documents
- semi-structured – JSON, XML, spreadsheets
- structured – CSVs, RDBMS tables, tabular spreadsheets
typically stored in a format as close to the raw source as possible. This is to take advantage of the elastic storage and to minimise bugs arising from transforming the source data before landing in permanent storage. Of these, many organisations will find that a large proportion of their business-critical data will be hosted within one or more relational database management systems (RDBMS). These sources, therefore, will be high on any priority list to be extracted and landed into the data lake to break down the data silos built around them and to democratise access to the data contained within.
In any data engineering exercise, it is important to make assessments of the 5Vs for each data set. The 5Vs are:
Velocity how quickly the data is created and changed and what are the latency requirements that your your pipelines must satisfy?
Volume how much data there is both in total and per item?
Value what is the value that the data provides the business and how does this change as the data ages?
Variety how many different shapes of data are there (for RDBMS the variety should be very low with the occasional NULL field or free text column) and how does this change over time (how do you deal with slowly changing dimensions)?
Veracity how consistent and understood is your data? is the process by which it is created, moved and transformed documented, validated and repeatable?
With these attributes understood for the dataset in question the particular pattern of ingest into the lake should be straight forward to determine.
If the data is high velocity (requires low latency) and/or every change to every item represents value to the business, then a replication solution like AWS DMS (data migration service) is appropriate. The challenge with such an approach is that the data will land in Amazon S3 in CSV or Parquet format with the operation ([I]nsert, [U]pdate or [D]elete), table name, database schema name followed by the column values for the new row state (or the deleted row data). In general, this format will not be ideal for consumption via BI services or for direct import into downstream data warehouses or data marts. Consequently, some form of “hydration” into a point-in-time state will be necessary.
If the velocity is lower (hourly, daily or above) then batch processing is a valid option. This can take several forms:
If the volume is low, then a simple and effective approach is to ingest the entire table according to the schedule. This ensures that historical updates are captured in addition to new data. This can be achieved using AWS DMS, AWS Glue or the databases native export capabilities.
If the volume or velocity is such that complete table copy is ineffective then full backup followed by incremental delta capture is necessary. How the deltas are defined will depend on the data available within the table or from domain knowledge of how the upstream table changes.
If the data in the table is immutable then the deltas can be captured by keeping the last compound key for the table. AWS Glue with Job booking marking enabled is an implementation of this pattern.
However, many tables are transactional and will support updates which tracking a table watermark will not identify. If all of the intermediate states of records don’t matter at the scheduled granularity then a column in the source table that captures the date-time each record was last updated can be used. Then the delta is defined by rows with an updated time after the last data ingest. With appropriate indexing this lookup can be made efficient and only changed rows will be imported. The challenge then is how these deltas are applied and how data hydration is achieved. Using Apache Hudi via Amazon EMR or via a custom AWS Glue connector enables this transparently provided the Amazon S3 destination is defined as an Apache Hudi table.
If “updated at” columns are not already present in the source table then business rules of thumb can be utilised instead. For instance, if it is known that certain business processes run on a known schedule (e.g. financial reconciliations) such that the data is frozen after review then this can be used to define the deltas. No data for ids existing prior to the last freeze need to be brought over. Any records with ids after are replaced according to the schedule. The challenge, like for updates generally, is how to integrate them into the existing Amazon S3 storage. In this instance, if the Apache Hudi solutions cannot be brought to bear then Amazon S3 table partitioning can ensure the deltas can be applied efficiently. By partitioning the data in Amazon S3 according to the data freeze schedule then only the most recent partition of data needs to be overwritten. The downside to this approach is that the freeze schedule will not result in a partition scheme that aligns with the data query patterns arising from downstream analysis services. To overcome this additional data transforms can restructure the data as required.
These ingest patterns should cover most of the combinations of value, variety, velocity, veracity and volume data scenarios arising from importing relational data into an Amazon S3 data lake.