Posts

Migrating Relational Data into an Amazon S3 Data Lake

5 V's of Big Data

The concept of a data lake is not new, but with the proliferation and adoption of cloud providers the capacity for many companies to adopt the model has exploded. A data lake is a centralised store for all kinds of business data:

  • unstructured – images, videos, PDFs, Word documents
  • semi-structured – JSON, XML, spreadsheets
  • structured – CSVs, RDBMS tables, tabular spreadsheets

typically stored in a format as close to the raw source as possible. This is to take advantage of the elastic storage and to minimise bugs arising from transforming the source data before landing in permanent storage. Of these, many organisations will find that a large proportion of their business-critical data will be hosted within one or more relational database management systems (RDBMS). These sources, therefore, will be high on any priority list to be extracted and landed into the data lake to break down the data silos built around them and to democratise access to the data contained within.

In any data engineering exercise, it is important to make assessments of the 5Vs for each data set. The 5Vs are:

Velocity how quickly the data is created and changed and what are the latency requirements that your your pipelines must satisfy?

Volume how much data there is both in total and per item?

Value what is the value that the data provides the business and how does this change as the data ages?

Variety how many different shapes of data are there (for RDBMS the variety should be very low with the occasional NULL field or free text column) and how does this change over time (how do you deal with slowly changing dimensions)?

Veracity how consistent and understood is your data? is the process by which it is created, moved and transformed documented, validated and repeatable?

With these attributes understood for the dataset in question the particular pattern of ingest into the lake should be straight forward to determine.

If the data is high velocity (requires low latency) and/or every change to every item represents value to the business, then a replication solution like AWS DMS (data migration service) is appropriate.  The challenge with such an approach is that the data will land in Amazon S3 in CSV or Parquet format with the operation ([I]nsert, [U]pdate or [D]elete), table name, database schema name followed by the column values for the new row state (or the deleted row data). In general, this format will not be ideal for consumption via BI services or for direct import into downstream data warehouses or data marts. Consequently, some form of “hydration” into a point-in-time state will be necessary.

If the velocity is lower (hourly, daily or above) then batch processing is a valid option. This can take several forms:

If the volume is low, then a simple and effective approach is to ingest the entire table according to the schedule. This ensures that historical updates are captured in addition to new data. This can be achieved using AWS DMS, AWS Glue or the databases native export capabilities.

If the volume or velocity is such that complete table copy is ineffective then full backup followed by incremental delta capture is necessary. How the deltas are defined will depend on the data available within the table or from domain knowledge of how the upstream table changes.

If the data in the table is immutable then the deltas can be captured by keeping the last compound key for the table. AWS Glue with Job booking marking enabled is an implementation of this pattern.

However, many tables are transactional and will support updates which tracking a table watermark will not identify. If all of the intermediate states of records don’t matter at the scheduled granularity then a column in the source table that captures the date-time each record was last updated can be used.  Then the delta is defined by rows with an updated time after the last data ingest. With appropriate indexing this lookup can be made efficient and only changed rows will be imported. The challenge then is how these deltas are applied and how data hydration is achieved. Using Apache Hudi via Amazon EMR or via a custom AWS Glue connector enables this transparently provided the Amazon S3 destination is defined as an Apache Hudi table.

If “updated at” columns are not already present in the source table then business rules of thumb can be utilised instead. For instance, if it is known that certain business processes run on a known schedule (e.g. financial reconciliations) such that the data is frozen after review then this can be used to define the deltas. No data for ids existing prior to the last freeze need to be brought over. Any records with ids after are replaced according to the schedule. The challenge, like for updates generally, is how to integrate them into the existing Amazon S3 storage. In this instance, if the Apache Hudi solutions cannot be brought to bear then Amazon S3 table partitioning can ensure the deltas can be applied efficiently. By partitioning the data in Amazon S3 according to the data freeze schedule then only the most recent partition of data needs to be overwritten. The downside to this approach is that the freeze schedule will not result in a partition scheme that aligns with the data query patterns arising from downstream analysis services. To overcome this additional data transforms can restructure the data as required.

These ingest patterns should cover most of the combinations of value, variety, velocity, veracity and volume data scenarios arising from importing relational data into an Amazon S3 data lake.

Geostream TM

How can businesses improve mobile customer experience?

Geostream TM

GETTING CLOSER TO YOUR MOBILE USERS – NO MATTER WHERE THEY ARE IN THE WORLD.

A couple of years ago one of our clients approached us with a difficult question:

“How can we understand and improve mobile game performance and player experience in targeted locations around the World?”

This was not easy to answer.

Simulating current game performance

The first step was figuring out how to simulate game performance on mobile devices across different geographical regions and mobile networks.

We considered simulating latency and throughput via selected data centre virtual machines with traffic shaping, to introduce errors and real-world experiences, but we quickly shelved this idea as it does not account for the many factors that may impact mobile network performance.

You may ask, “Why is this a challenge?”, so think about these problem statements:

What is a person’s experience in Accra Ghana or Novosibirsk Siberia?
Why are some geographies or locations performing poorly?
Which mobile networks indicate worst player experience and why?
How do I measure any improvements once implemented?

We concluded that there was no solution available for our customer. We promptly set about building one.

The idea, created by the team, was to build a device which would mimic a mobile phone, a customer and that could be deployed on site at each targeted location. Using physical mobile phones was not a viable option due to battery, heat and other reliability issues.

The mobile devices needed to fit the following criteria:

  • Capable of running 24/7 on a continuous basis.
  • Be compatible with local SIM cards for the mobile networks.
  • Processor capacity should be close to that of a mobile phone.
  • Must be small, transportable and robust.
  • The device would need to be completely automated, with almost no user intervention.

Our team tested a number of microcomputer options and finally settled on a customised version of the Raspberry Pi4. Additional 4G modules, external aerials, a cooling system and a custom designed aluminium housing were added to the package. Delivering the capacity of 4 mobile networks (SIMs) per housing, now named GeoStream™.
Geostream TM

Each GeoStream™ enables 4 mobile networks to be performance tested and monitored per location.

Data-driven decisions are the core of TechConnect’s capability, delivering solutions to our customers and unlocking value from data. It, therefore, goes without saying that we must collect and analyse the data. GeoStream™ includes the ability to:

  • Capture the data and send it back to a centralised datalake in Australia.
  • Deploy multiple devices per location for broader testing and data collection.
  • Securely house and cool devices to deliver robustness in remote locations.
  • Reporting and analytics platform to analyse the data and enable data-driven decisions.
  • A scheduling platform for scheduling activity and runbooks.

Collecting the data

Custom Python scripts automatically connect the devices back to the datalake and control system in Sydney. Data is analysed via a web-based front-end application.

The GeoStream™ front-end is built on Angular and utilises a number of Amazon Web Services capabilities. The front-end gives administrators the ability to assign role-based access controls for roles such as admin, tester, scheduler and analytics.

The first revision of GeoStream™ was released late 2019, with the very first unit being deployed to Ghana. GeoStream™ has assisted to deliver improvements across both the MTN and Vodafone networks in Accra. The Ghana unit continues serving analytics and test result data to the datalake in Sydney. GeoStream™ revision 2 was released in July 2020 and is destined for regions throughout Canada, New Zealand, India, Taiwan, London and Estonia later in 2020, with many more expected to follow.

GeoStream™ has delivered a solution to mobile network performance analysis, from within country and satisfying many of the challenges facing mobile network analysis. The product provides a commercially viable tool that gives our client an accurate view of their mobile users experience on all their applications, no matter where they are and at any time of day.

“We love to innovate and when innovation intersects customer value we have made a real impact.” Mike Cunningham – TechConnect CEO.

TechConnect plans to roll out many GeoStream™ devices in conjunction with our private content delivery network, named Slipstream, to improve mobile player experience for our customer’s customer.

TechConnect IT Solutions_Data

Digital Transformation with Data

TechConnect IT Solutions_Data

Harnessing Data to drive effective digital transformation

The COVID-19 pandemic has made clear that businesses need to be prepared for flexible, remote working practices.

As lockdowns forced offices to close and people headed home to limit the potential spread of the virus, many organisations found they weren’t prepared to provision the necessary work from home (WFH) technology and processes for their staff to continue with business as usual.

As a result, businesses have been required to undertake (or accelerate) a significant digital transformation journey to get up to speed. As these transformation journeys roll out, the need to harness data effectively becomes more critical than ever for a successful, long term change. Here’s what you need to consider.

A strategic approach

Before beginning a digital transformation, it’s critical to have a strategy in place to explain how you will manage, store, secure and use your data. Yet, this is a step that’s often forgotten in the rush to transform and digitise processes.

A data strategy should be driven by the needs of your business. Your strategy will also define how to make decisions about the use of data, more capably manage data flow, and secure information effectively.

Any successful plan will identify realistic goals along with a road map for rolling it out. This ensures that you’re properly prepared for every step of the journey.

Beginning the journey

A digital transformation unshackles an organisation from the past. It empowers you to move into the future, free of outdated technology and slower manual processes.

For example, take mobile and cloud technology. While we were once restricted to an office environment for productive working, it’s now possible for geographically diverse teams to collaborate as efficiently as they would in a traditional office setting. Files, apps, and other resources can all be accessed remotely, and meetings held virtually, giving workplaces and workforces the ability to be truly flexible.

However, the reality of a digital transformation is that with staff spread across locations, there are a range of new infrastructure management issues to consider. Chief among these is data security.

Keeping data safe is vital as users access business networks and devices remotely, often without the protection provided by robust on-site architecture. It’s important to decide how you’ll service and secure company devices, and how to make sure users and the data they handle and generate will be protected, and implementing those systems early.

Harnessing the power of data

With a clear data strategy in place and your digital journey underway, you can start to take advantage of the power of your data and use it to drive improved decision-making internally and externally.

Artificial intelligence (AI) and machine learning (ML) can be used to sort your unstructured data, learning as they go to uncover valuable insights. Once the data has been cleansed, you can enrich it by adding third-party data or public datasets to uncover more hidden insight.

The adoption of AI and ML also frees your people for bigger picture tasks. Instead of manually sorting through stacks of data, they can concentrate on delivering valuable and creative work powered by the insights you’ve identified – ultimately working towards the goals outlined in your data strategy.

Gathering data helps to deliver external benefits to your business too. It can improve customer service by identifying current pain points or uncover new customer segments for targeting – the possibilities are endless!

The lesson in the journey

Businesses shouldn’t underestimate the change that needs to be undertaken in digital transformation journeys. They require significant planning and thought before beginning. However, while the challenge is large, data can make the journey less difficult and more successful.

Accessible, accurate and relevant data enables businesses to make better informed decisions and deliver actionable insights. And by establishing a data strategy up front, you can better understand, apply and secure your data to meet the needs of your organisation.

If you’re asking yourself questions such as “Are we doing things the right way?” or “Can we do this better?” why not get in touch and let’s explore how TechConnect can deliver results for your business as you undertake a digital transformation.

TechConnect achieves AWS Data and Analytics Competency

TechConnect Directors - Amazon Web Services - Data and Analytics Competency

AWS Data and Analytics Competency

Proves technical proficiency, operational excellence, security, reliability and 360-degree customer delivery capability; Cites major client projects with Virgin’s Velocity Frequent Flyer and IntelliHQ.

TechConnect IT Solutions (TechConnect), a leading provider of cloud services and an Amazon Web Services (AWS) Advanced Consulting Partner, today announces it has been awarded AAWS Data and Analytics Competency certification; the only Advanced Consulting Partner in Australia to achieve this prestigious competency level.

The AWS Competency Program recognises partners who demonstrate technical proficiency and proven customer success in specialised solution areas. TechConnect undertook a rigorous partner validation process to be awarded the certification, including an independent audit of its technical, organisational, governance and customer capabilities; along with scrutiny of large scale, in-production customer deployments.

Customer case studies that were reviewed as part of the certification process include a customer insights project with Velocity Frequent Flyer; a predictive medicine data platform for IntelliHQ that uses heart rate variability to predict patient outcomes; and a big data analytics project with Kamala Tech that gave technical users data to form insights across the business including areas such as data science, machine learning, marketing systems, reporting and self-service capabilities.

As healthcare comes under more and more pressure to deliver quality personalised care under constrained budgets the healthcare industry is seeing innovation with the use of data to drive efficiencies and better patient outcomes. As Machine Learning and Artificial Intelligence (AI) emerge as a driver for businesses to do more with less, healthcare can deliver better care with the same resources using data to drive out those efficiencies.

“We partner with industry in AI as we need as many talented and gifted people in this space as possible. said Dr Brent Richards. is the Medical Director of Innovation – Gold Coast Hospital and Health Service (GCHHS). “There is a lot that the industry can bring that healthcare specifically does not have in terms of hardware, software and talent.” Dr Richards played a key role in the project with IntelliHQ which is a partnership between Gold Coast Health, industry and universities to transform healthcare through AI, enhancing patient outcomes and improving quality of care, while maximising cost-effectiveness.

Oliver Rees, Chief Analytics Officer with Virgin’s Velocity Frequent Flyer program has commended TechConnect’s expertise and cites the direct benefits for member experience. “Velocity have always been a company that is passionate about using insights to understand and improve on their members’ experience. Velocity worked with TechConnect to build a platform that would allow Velocity to combine member insights in a single location to make it easier for members to then receive relevant program offers,” said Oliver Rees, Chief Analytics Officer – Velocity Frequent Flyer.

“In achieving this level of competency with AWS, TechConnect has demonstrated our ability to help customers solve their most challenging data problems within large scale production deployments. We proved that we have deep expertise in designing, implementing, and managing Data and Analytics applications on the AWS platform and have delivered solutions seamlessly in the AWS Cloud environment,” said Clinton Thomson, Director of TechConnect IT Solutions.

TechConnect is a fast-growing Australian company, headquartered in Queensland and serving clients around Australia and worldwide. TechConnect helps customers extract business value from data and it has plans to grow its team to 100+ people over the next three to five years, creating graduate employment and professional development opportunities in Queensland and throughout Australia. The company has offices in Brisbane and the Gold Coast and has a graduate pathways program for top students in the STEM fields.