Posts

Migrating Relational Data into an Amazon S3 Data Lake

5 V's of Big Data

The concept of a data lake is not new, but with the proliferation and adoption of cloud providers the capacity for many companies to adopt the model has exploded. A data lake is a centralised store for all kinds of business data:

  • unstructured – images, videos, PDFs, Word documents
  • semi-structured – JSON, XML, spreadsheets
  • structured – CSVs, RDBMS tables, tabular spreadsheets

typically stored in a format as close to the raw source as possible. This is to take advantage of the elastic storage and to minimise bugs arising from transforming the source data before landing in permanent storage. Of these, many organisations will find that a large proportion of their business-critical data will be hosted within one or more relational database management systems (RDBMS). These sources, therefore, will be high on any priority list to be extracted and landed into the data lake to break down the data silos built around them and to democratise access to the data contained within.

In any data engineering exercise, it is important to make assessments of the 5Vs for each data set. The 5Vs are:

Velocity how quickly the data is created and changed and what are the latency requirements that your your pipelines must satisfy?

Volume how much data there is both in total and per item?

Value what is the value that the data provides the business and how does this change as the data ages?

Variety how many different shapes of data are there (for RDBMS the variety should be very low with the occasional NULL field or free text column) and how does this change over time (how do you deal with slowly changing dimensions)?

Veracity how consistent and understood is your data? is the process by which it is created, moved and transformed documented, validated and repeatable?

With these attributes understood for the dataset in question the particular pattern of ingest into the lake should be straight forward to determine.

If the data is high velocity (requires low latency) and/or every change to every item represents value to the business, then a replication solution like AWS DMS (data migration service) is appropriate.  The challenge with such an approach is that the data will land in Amazon S3 in CSV or Parquet format with the operation ([I]nsert, [U]pdate or [D]elete), table name, database schema name followed by the column values for the new row state (or the deleted row data). In general, this format will not be ideal for consumption via BI services or for direct import into downstream data warehouses or data marts. Consequently, some form of “hydration” into a point-in-time state will be necessary.

If the velocity is lower (hourly, daily or above) then batch processing is a valid option. This can take several forms:

If the volume is low, then a simple and effective approach is to ingest the entire table according to the schedule. This ensures that historical updates are captured in addition to new data. This can be achieved using AWS DMS, AWS Glue or the databases native export capabilities.

If the volume or velocity is such that complete table copy is ineffective then full backup followed by incremental delta capture is necessary. How the deltas are defined will depend on the data available within the table or from domain knowledge of how the upstream table changes.

If the data in the table is immutable then the deltas can be captured by keeping the last compound key for the table. AWS Glue with Job booking marking enabled is an implementation of this pattern.

However, many tables are transactional and will support updates which tracking a table watermark will not identify. If all of the intermediate states of records don’t matter at the scheduled granularity then a column in the source table that captures the date-time each record was last updated can be used.  Then the delta is defined by rows with an updated time after the last data ingest. With appropriate indexing this lookup can be made efficient and only changed rows will be imported. The challenge then is how these deltas are applied and how data hydration is achieved. Using Apache Hudi via Amazon EMR or via a custom AWS Glue connector enables this transparently provided the Amazon S3 destination is defined as an Apache Hudi table.

If “updated at” columns are not already present in the source table then business rules of thumb can be utilised instead. For instance, if it is known that certain business processes run on a known schedule (e.g. financial reconciliations) such that the data is frozen after review then this can be used to define the deltas. No data for ids existing prior to the last freeze need to be brought over. Any records with ids after are replaced according to the schedule. The challenge, like for updates generally, is how to integrate them into the existing Amazon S3 storage. In this instance, if the Apache Hudi solutions cannot be brought to bear then Amazon S3 table partitioning can ensure the deltas can be applied efficiently. By partitioning the data in Amazon S3 according to the data freeze schedule then only the most recent partition of data needs to be overwritten. The downside to this approach is that the freeze schedule will not result in a partition scheme that aligns with the data query patterns arising from downstream analysis services. To overcome this additional data transforms can restructure the data as required.

These ingest patterns should cover most of the combinations of value, variety, velocity, veracity and volume data scenarios arising from importing relational data into an Amazon S3 data lake.

TechConnect IT Solutions_Data

Digital Transformation with Data

TechConnect IT Solutions_Data

Harnessing Data to drive effective digital transformation

The COVID-19 pandemic has made clear that businesses need to be prepared for flexible, remote working practices.

As lockdowns forced offices to close and people headed home to limit the potential spread of the virus, many organisations found they weren’t prepared to provision the necessary work from home (WFH) technology and processes for their staff to continue with business as usual.

As a result, businesses have been required to undertake (or accelerate) a significant digital transformation journey to get up to speed. As these transformation journeys roll out, the need to harness data effectively becomes more critical than ever for a successful, long term change. Here’s what you need to consider.

A strategic approach

Before beginning a digital transformation, it’s critical to have a strategy in place to explain how you will manage, store, secure and use your data. Yet, this is a step that’s often forgotten in the rush to transform and digitise processes.

A data strategy should be driven by the needs of your business. Your strategy will also define how to make decisions about the use of data, more capably manage data flow, and secure information effectively.

Any successful plan will identify realistic goals along with a road map for rolling it out. This ensures that you’re properly prepared for every step of the journey.

Beginning the journey

A digital transformation unshackles an organisation from the past. It empowers you to move into the future, free of outdated technology and slower manual processes.

For example, take mobile and cloud technology. While we were once restricted to an office environment for productive working, it’s now possible for geographically diverse teams to collaborate as efficiently as they would in a traditional office setting. Files, apps, and other resources can all be accessed remotely, and meetings held virtually, giving workplaces and workforces the ability to be truly flexible.

However, the reality of a digital transformation is that with staff spread across locations, there are a range of new infrastructure management issues to consider. Chief among these is data security.

Keeping data safe is vital as users access business networks and devices remotely, often without the protection provided by robust on-site architecture. It’s important to decide how you’ll service and secure company devices, and how to make sure users and the data they handle and generate will be protected, and implementing those systems early.

Harnessing the power of data

With a clear data strategy in place and your digital journey underway, you can start to take advantage of the power of your data and use it to drive improved decision-making internally and externally.

Artificial intelligence (AI) and machine learning (ML) can be used to sort your unstructured data, learning as they go to uncover valuable insights. Once the data has been cleansed, you can enrich it by adding third-party data or public datasets to uncover more hidden insight.

The adoption of AI and ML also frees your people for bigger picture tasks. Instead of manually sorting through stacks of data, they can concentrate on delivering valuable and creative work powered by the insights you’ve identified – ultimately working towards the goals outlined in your data strategy.

Gathering data helps to deliver external benefits to your business too. It can improve customer service by identifying current pain points or uncover new customer segments for targeting – the possibilities are endless!

The lesson in the journey

Businesses shouldn’t underestimate the change that needs to be undertaken in digital transformation journeys. They require significant planning and thought before beginning. However, while the challenge is large, data can make the journey less difficult and more successful.

Accessible, accurate and relevant data enables businesses to make better informed decisions and deliver actionable insights. And by establishing a data strategy up front, you can better understand, apply and secure your data to meet the needs of your organisation.

If you’re asking yourself questions such as “Are we doing things the right way?” or “Can we do this better?” why not get in touch and let’s explore how TechConnect can deliver results for your business as you undertake a digital transformation.

What's the difference between Artificial Intelligence (AI) & Machine Learning (ML)?

What’s the difference between Artificial Intelligence (AI) & Machine Learning (ML)?

What’s the difference between Artificial Intelligence (AI) & Machine Learning (ML)?

The field of Artificial Intelligence encompasses all efforts at imbuing computational devices with capabilities that have traditionally been viewed as requiring human-level intelligence. 

This includes:

  • Chess, go and generalised game playing 
  • Planning and goal-directed behaviour in dynamic and complex environments 
  • Theorem proving, proof assistants and symbolic reasoning 
  • Computer vision  
  • Natural language understanding and translation 
  • Deductive, inductive and abductive reasoning 
  • Learning from experience and existing data 
  • Understanding and emulating emotion 
  • Fuzzy and probabilistic (Bayesean) reasoning 
  • Communication, teamwork, negotiation and argumentation between self-interested agents 
  • Early advances in signal processing (text to speech) 
  • Music understanding and creation 

Like intelligence itself it defies definition.

As a field, it predates Machine Learning and Machine Learning was seen as an early sub-field. Many things that are obvious or no longer considered AI have their roots in the field. Many database models (hierarchical, network and relational) have their roots in AI research. Optimisation and scheduling were early problems tackled under the umbrella of AI. Minsky’s Frame model reads like an early description of Object Oriented programming. LISP, Prolog and many other programming languages and programming language properties emerged as tools for or as a result of AI research.

Neural networks (a sub-field of machine learning) emerged in the 80s in the form of perceptrons and were heavily studied until it was demonstrated that a perceptron was unable to calculate XOR. However, with the invent of error back propagation over networks of perceptrons (a way to systematically train the weights between neurons) it was shown that neural networks have equivalent computational power to universal turing machines (if it can be computed on a turing machine a correctly configured neural network can also implement that same function).

With the invent of Deep Learning in the 2010s the popularity of machine learning has soared as great successes have been achieved using the approach. Due to limits on computational power, traditional neural networks were trained on meticulously human engineered features of the datasets, not the raw datasets themselves. With the progress in cloud, gpus and distributed learning it became possible to create much larger and deeper neural networks. This progressed to the point that large raw datasets could be used directly to train with and get predictions from. In so doing the neural networks extract their own features from the data as part of this process. Many of the recent advances have been achieved due to this (in addition to better neuron activation functions, faster training algorithms, new network architectures).  The successes have also inspired people to use Deep Learning as a means of solving some of the other problems in general AI (as discussed above) and this may explain why a convergence or confusion between AI and Machine Learning is perceived by many.

Machine Learning using Convolutional Neural Networks

Machine Learning with Amazon SageMaker

Computers are generally programmed to do what the developer dictates and will only behave predictably under the specified scenarios.

In recent years, people are increasingly turning to computers to perform tasks that can’t be achieved with traditional programming, which previously had to be done by humans performing manual tasks.   Machine Learning gives computers the ability to ‘learn’ and act on information based on observations without being explicitly programmed.

TechConnect entered the recent Get 2 the Core challenge on Unearthed’s crowd sourcing platformThis is TechConnect’s story, as part of the crowd sourcing approach, and does not imply or assert in any way that Newcrest Mining endorse Amazon Web Services or the work TechConnect have performed in this challenge.

Business problem

Currently a team at Newcrest Mining manually crop photographs of drill core samples before the photos can be fed into a system which detects the material type. This is extremely time-consuming due to the large number of photos. Hence why Newcrest Mining used crowd sourcing via the Unearthed platform, a platform bringing data scientists, start-ups and the energy & natural resources industry together.

Being able to automatically identify bounding box co-ordinates of the samples within an image would save 80-90% of the time spent preparing the photos.

Input Image

Machine Learning using Convolutional Neural Networks

Expected Output Image

Machine Learning using Convolutional Neural Networks

 

Before we can begin implementing an object-detection process, we first need to address a variety of issues with the photographs themselves, being:

  • Not all photos are straight
  • Not all core trays are in a fixed position relative to the camera
  • Not all photos are taken perpendicular to the core trays introducing a perspective distortion
  • Not all photos are high-resolution

In addition to the object-classification, we need to use an image-classification process to classify each image into a group based on the factors above. The groups are defined as:

Group 0 – Core trays are positioned correctly in the images with no distortion. This is the ideal case
Group 1 – Core trays are misaligned in the image
Group 2 – Core trays have perspective distortion
Group 3 – Core trays are misaligned and have perspective distortion
Group 4 – The photo has a low aspect ratio
Group 5 – The photo has a low aspect ratio and are misaligned

CNN Image Detection with Amazon Sagemaker

Solution

We tried to solve this problem using Machine Learning. In particular, we used supervised learning. When conducting supervised learning the system is provided with the input data and the classification/label desired output for each data point. The system learns a model that when provided a previously seen input will reliably output the correct labelling or the most likely label when an unseen input is provided.

This differs from unsupervised learning. When utilising unsupervised techniques, the target label is unknown and the system must group or derive the label from the inherent properties within the data set itself.

The Supervised Machine Learning process works by:

  1. Obtaining, preparing & labelling the input data
  2. Create a model
  3. Train the model
  4. Test the model
  5. Deploy & use the model

There are many specific algorithms for supervised learning that are appropriate for different learning tasks. The object detection and classification problem of identifying core samples in images is particularly suited to a technique known as convolutional neural networks. The model ‘learns’ by assigning and constantly adjusting internal weights and biases for each input of the training data to produce the specified output. The weights and biases become more accurate with more training data.

Amazon SageMaker provides a hosted platform that enabled us to quickly build, train, test and deploy our model.

Newcrest Mining provided a large collection of their photographs which contain core samples. A large subset of the photos also contained the expected output, which we used to train our model.

The expected output is a set of four (X, Y) coordinates per core sample in the photograph. The coordinates represent the corners of the bounding box that surrounds the core sample. Multiple sets of coordinates are expected for photos that contain multiple core samples.

The Process

We uploaded the supplied data to an AWS S3 bucket, using a separate prefix to separate images which we were provided the expected output for, and those with no output. S3 is an ideal store for the raw images with high durability, infinite capacity and direct integration with many other AWS products.

We further randomly split the photos with the expected output into a training dataset (70%) and a testing dataset (30%).

We created a Jupyter notebook on an Amazon SageMaker notebook instance to host and execute our code. By default the Jupyter notebook instance provides access to a wide variety of common data science tools such as numpy, tensorflow and matplotlib in addition to the Amazon SageMaker and AWS python SDKs. This allowed us to immediately focus on our particular problem of creating SageMaker compatible datasets with which we could build and test our models.

We trained our model by feeding the training dataset along with the expected output into an existing Sagemaker built object detection model to fine tune it to our specific problem. SageMaker has a collection of hyperparameters which influence how the model ‘learns’. Adjusting the hyperparameter values affects the overall accuracy of the model and how long the training takes. As the training proceeded we were able to monitor the changes to the primary accuracy metric and pre-emptively cancel any training configurations that did not perform well. This saved us considerable time and money by allowing us to abort poor configurations early.

We then tested the accuracy of our model by feeding testing data – data it has never seen – without the output, then comparing the model’s output to the expected output.

After the first round of training we had our benchmark for accuracy. From there we were able to tune the model by iteratively adjusting the hyperparameters, model parameters and by augmenting the data set with additional examples then retraining and retesting. Setting the hyperparameter values is more of an artform than a science – trial and error is often the best way.

We used a technique which dynamically assigned values to the learning rate after each epoch, similar to a harmonic progression:

Harmonic Progression

This technique allowed us to start with large values to allow the model to converge quickly initially, then reduce the learning rate value by an increasingly smaller amount after each epoch as the model gets closer to an optimal solution.  After many iterations of tuning, training and testing we had improved the overall accuracy of the model compared with our benchmark, and with our project deadline fast approaching we decided that it was accurate as possible in the timeframe that we had.

We then used our model to classify and detect the objects in the remaining photographs that didn’t exist in the training set.  The following images show the bounding boxes around the cores that our model predicted:

CNN Bounding
CNN Bounding

Lessons Learned

Before we began we had an extremely high expectation of how accurate our model would be. In reality it wasn’t as accurate as our expectations.
We discussed things that could have made the model more accurate, train faster or both, including:

  • Tuning the hyperparameters using SageMakers automated hyperparameter tuning tooling
  • Copying the data across multiple regions to gain better access to the specific machine types we required for training
  • Increasing the size of the training dataset by:
    • Requesting more photographs
    • Duplicating the provided photographs and modifying them slightly. This included:
      • including duplicate copies of images and labels
      • including copies after converting the images to greyscale
      • including copies after changing the aspect ratio of the images
      • including copies after mirroring the images
  • Splitting the problem into separate, simpler machine learnable stages
  • Strategies for identifying the corners of the cores when they are not a rectangle in the image

During these discussions we realised we hadn’t defined a cut-off for when we would consider our model to be ‘accurate enough’.

As a general rule the accuracy of the models you build improve most rapidly in the first few iterations, after that the rate of improvement slows significantly. Each subsequent improvement requires lengthier training, more sophisticated algorithms and models, more sophisticated feature engineering or substantial changes to approach entirely. This trend is depicted in the following chart:

Learning accuracy over time

Depending on the use case, a model with an accuracy of 90% often requires significantly less training time, engineering effort and sophistication than a model with an accuracy of 93%. The acceptance criteria for a model needs to carefully balance these considerations to maximise the overall return on investment for the project.

In our case time was the factor that dictated when we stopped training and started using the model to produce the outputs for unseen photographs.

 

Thank you to the team at TechConnect that volunteered to try Amazon Sagemaker to address the Get 2 the Core Challenge posted by Newcrest Mining on the Unearthered portal.  Also big thanks for sharing lessons learned and putting this blog together!

Intensive Care Unit - Data Collection

Precision Medicine Data Platform

Recently TechConnect and IntelliHQ attended the eHealth Expo 2018. IntelliHQ are specialists in Machine Learning in the health space, and are the innovators behind the development of a cloud-based precision medicine data platform. TechConnect are IntelliHQ’s cloud technology partners, and our strong relationship with Amazon Web Services and the AWS life sciences team has enabled us to deliver the first steps towards building out the precision medicine data platform.

This video certainly sums up the goals of IntelliHQ and how TechConnect are partnering to deliver solutions in life sciences on the Amazon Web Services cloud platform.

Achieving this level of integration with the General Electric Carescape High Speed Data Interface is a first in Australia and potentially a first outside of America. TechConnect have designed a lightweight service to connect to the GE Carescape and push the high fidelity data to Amazon Kinesis Firehose and then to persisted cost effective storage on Amazon S3.

With the raw data stored on Amazon S3, data lake principles can be applied to enrich and process data for research and ultimately help save more lives in a proactive way. The diagram below shows a high level architecture that supports the data collection and machine learning capability inside the precision medicine data platform.

 

GE Carescape HSDI to Cloud Connector

This software, named Panacea, will be made available as an open source project.

Be sure to explore the following two sources of further information:

Check out Dr Brent Richards’ presentation at the recent eHealth Expo 2018 as well as a selection of other speakers located here.

AIkademi seeks to develop the capabilities of individuals, organisations and communities to embrace the opportunities emerging from machine learning.

Portfolio Items