How Open Universities Australia modernized their knowledge platform and considerably lowered their ETL prices with AWS Cloud Growth Equipment and AWS Step Capabilities

How Open Universities Australia modernized their knowledge platform and considerably lowered their ETL prices with AWS Cloud Growth Equipment and AWS Step Capabilities


This can be a visitor publish co-authored by Michael Davies from Open Universities Australia.

At Open Universities Australia (OUA), we empower college students to discover an unlimited array of levels from famend Australian universities, all delivered by way of on-line studying. We provide college students various pathways to attain their academic aspirations, offering them with the flexibleness and accessibility to achieve their educational objectives. Since our founding in 1993, we’ve supported over 500,000 college students to attain their objectives by offering pathways to over 2,600 topics at 25 universities throughout Australia.

As a not-for-profit group, price is an important consideration for OUA. Whereas reviewing our contract for the third-party instrument we had been utilizing for our extract, rework, and cargo (ETL) pipelines, we realized that we may replicate a lot of the identical performance utilizing Amazon Net Providers (AWS) companies similar to AWS Glue, Amazon AppFlow, and AWS Step Capabilities. We additionally acknowledged that we may consolidate our supply code (a lot of which was saved within the ETL instrument itself) right into a code repository that may very well be deployed utilizing the AWS Cloud Growth Equipment (AWS CDK). By doing so, we had a chance to not solely scale back prices but additionally to reinforce the visibility and maintainability of our knowledge pipelines.

On this publish, we present you ways we used AWS companies to exchange our current third-party ETL instrument, bettering the crew’s productiveness and producing a major discount in our ETL operational prices.

Our strategy

The migration initiative consisted of two important elements: constructing the brand new structure and migrating knowledge pipelines from the prevailing instrument to the brand new structure. Usually, we might work on each in parallel, testing one element of the structure whereas creating one other on the similar time.

From early in our migration journey, we started to outline just a few guiding rules that we might apply all through the event course of. These had been:

  • Easy and modular – Use easy, reusable design patterns with as few shifting elements as potential. Construction the code base to prioritize ease of use for builders.
  • Value-effective – Use sources in an environment friendly, cost-effective method. Purpose to reduce conditions the place sources are working idly whereas ready for different processes to be accomplished.
  • Enterprise continuity – As a lot as potential, make use of current code fairly than reinventing the wheel. Roll out updates in phases to reduce potential disruption to current enterprise processes.

Structure overview

The next Diagram 1 is the high-level structure for the answer.

Diagram 1: General structure of the answer, utilizing AWS Step Capabilities, Amazon Redshift and Amazon S3

The next AWS companies had been used to form our new ETL structure:

  • Amazon Redshift – A completely managed, petabyte-scale knowledge warehouse service within the cloud. Amazon Redshift served as our central knowledge repository, the place we might retailer knowledge, apply transformations, and make knowledge obtainable to be used in analytics and enterprise intelligence (BI). Word: The provisioned cluster itself was deployed individually from the ETL structure and remained unchanged all through the migration course of.
  • AWS Cloud Growth Equipment (AWS CDK) – The AWS Cloud Growth Equipment (AWS CDK) is an open-source software program improvement framework for outlining cloud infrastructure in code and provisioning it by way of AWS CloudFormation. Our infrastructure was outlined as code utilizing the AWS CDK. In consequence, we simplified the way in which we outlined the sources we needed to deploy whereas utilizing our most well-liked coding language for improvement.
  • AWS Step Capabilities – With AWS Step Capabilities, you may create workflows, additionally known as State machines, to construct distributed purposes, automate processes, orchestrate microservices, and create knowledge and machine studying pipelines. AWS Step Capabilities can name over 200 AWS companies together with AWS Glue, AWS Lambda, and Amazon Redshift. We used the AWS Step Operate state machines to outline, orchestrate, and execute our knowledge pipelines.
  • Amazon EventBridge – We used Amazon EventBridge, the serverless occasion bus service, to outline the event-based guidelines and schedules that will set off our AWS Step Capabilities state machines.
  • AWS Glue – A knowledge integration service, AWS Glue consolidates main knowledge integration capabilities right into a single service. These embody knowledge discovery, trendy ETL, cleaning, remodeling, and centralized cataloging. It’s additionally serverless, which implies there’s no infrastructure to handle. consists of the flexibility to run Python scripts. We used it for executing long-running scripts, similar to for ingesting knowledge from an exterior API.
  • AWS Lambda – AWS Lambda is a extremely scalable, serverless compute service. We used it for executing easy scripts, similar to for parsing a single textual content file.
  • Amazon AppFlow – Amazon AppFlow permits easy integration with software program as a service (SaaS) purposes. We used it to outline flows that will periodically load knowledge from chosen operational programs into our knowledge warehouse.
  • Amazon Easy Storage Service (Amazon S3) – An object storage service providing industry-leading scalability, knowledge availability, safety, and efficiency. Amazon S3 served as our staging space, the place we might retailer uncooked knowledge previous to loading it into different companies similar to Amazon Redshift. We additionally used it as a repository for storing code that may very well be retrieved and utilized by different companies.

The place sensible, we made use of the file construction of our code base for outlining sources. We arrange our AWS CDK to check with the contents of a particular listing and outline a useful resource (for instance, an AWS Step Capabilities state machine or an AWS Glue job) for every file it present in that listing. We additionally made use of configuration information so we may customise the attributes of particular sources as required.

Particulars on particular patterns

Within the above structure Diagram 1, we confirmed a number of flows by which knowledge may very well be ingested or unloaded from our Amazon Redshift knowledge warehouse. On this part, we spotlight 4 particular patterns in additional element which had been utilized within the closing resolution.

Sample 1: Information transformation, load, and unload

A number of of our knowledge pipelines included important knowledge transformation steps, which had been primarily carried out by way of SQL statements executed by Amazon Redshift. Others required ingestion or unloading of information from the info warehouse, which may very well be carried out effectively utilizing COPY or UNLOAD statements executed by Amazon Redshift.

Consistent with our purpose of utilizing sources effectively, we sought to keep away from working these statements from throughout the context of an AWS Glue job or AWS Lambda perform as a result of these processes would stay idle whereas ready for the SQL assertion to be accomplished. As an alternative, we opted for an strategy the place SQL execution duties can be orchestrated by an AWS Step Capabilities state machine, which might ship the statements to Amazon Redshift and periodically examine their progress earlier than marking them as both profitable or failed. The next Diagram 2 reveals this workflow.

Data transformation, load, and unload

Diagram 2: Information transformation, load, and unload sample utilizing Amazon Lambda and Amazon Redshift inside an AWS Step Operate

Sample 2: Information replication utilizing AWS Glue

In instances the place we would have liked to copy knowledge from a third-party supply, we used AWS Glue to run a script that will question the related API, parse the response, and retailer the related knowledge in Amazon S3. From right here, we used Amazon Redshift to ingest the info utilizing a COPY assertion. The next Diagram 3 reveals this workflow.

Image 3: Copying from external API to Redshift with AWS Glue

Diagram 3: Copying from exterior API to Redshift with AWS Glue

Word: An alternative choice for this step can be to make use of Amazon Redshift auto-copy, however this wasn’t obtainable at time of improvement.

Sample 3: Information replication utilizing Amazon AppFlow

For sure purposes, we had been ready to make use of Amazon AppFlow flows rather than AWS Glue jobs. In consequence, we may summary a few of the complexity of querying exterior APIs straight. We configured our Amazon AppFlow flows to retailer the output knowledge in Amazon S3, then used an EventBridge rule based mostly on an Finish Stream Run Report occasion (which is an occasion which is revealed when a circulate run is full) to set off a load into Amazon Redshift utilizing a COPY assertion. The next Diagram 4 reveals this workflow.

Through the use of Amazon S3 as an intermediate knowledge retailer, we gave ourselves higher management over how the info was processed when it was loaded into Amazon Redshift, when put next with loading the info on to the info warehouse utilizing Amazon AppFlow.

Image 4: Using Amazon AppFlow to integrate external data

Diagram 4: Utilizing Amazon AppFlow to combine exterior knowledge to Amazon S3 and duplicate to Amazon Redshift

Sample 4: Reverse ETL

Though most of our workflows contain knowledge being introduced into the info warehouse from exterior sources, in some instances we would have liked the info to be exported to exterior programs as a substitute. This fashion, we may run SQL queries with advanced logic drawing on a number of knowledge sources and use this logic to help operational necessities, similar to figuring out which teams of scholars ought to obtain particular communications.

On this circulate, proven within the following Diagram 5, we begin by working an UNLOAD assertion in Amazon Redshift to unload the related knowledge to information in Amazon S3. From right here, every file is processed by an AWS Lambda perform, which performs any vital transformations and sends the info to the exterior utility by way of a number of API calls.

Image 5: Reverse ETL workflow, sending data back out to external data sources

Diagram 5: Reverse ETL workflow, sending knowledge again out to exterior knowledge sources

Outcomes

The re-architecture and migration course of took 5 months to finish, from the preliminary idea to the profitable decommissioning of the earlier third-party instrument. Many of the architectural effort was accomplished by a single full-time worker, with others on the crew primarily aiding with the migration of pipelines to the brand new structure.

We achieved important price reductions, with closing bills on AWS native companies representing solely a small share of projected prices in comparison with persevering with with the third-party ETL instrument. Shifting to a code-based strategy additionally gave us higher visibility of our pipelines and made the method of sustaining them faster and simpler. General, the transition was seamless for our finish customers, who had been in a position to view the identical knowledge and dashboards each throughout and after the migration, with minimal disruption alongside the way in which.

Conclusion

Through the use of the scalability and cost-effectiveness of AWS companies, we had been in a position to optimize our knowledge pipelines, scale back our operational prices, and enhance our agility.

Pete Allen, an analytics engineer from Open Universities Australia, says, “Modernizing our knowledge structure with AWS has been transformative. Transitioning from an exterior platform to an in-house, code-based analytics stack has vastly improved our scalability, flexibility, and efficiency. With AWS, we will now course of and analyze knowledge with a lot quicker turnaround, decrease prices, and better availability, enabling speedy improvement and deployment of information options, resulting in deeper insights and higher enterprise choices.”

Further sources


In regards to the Authors

Michael Davies is a Information Engineer at OUA. He has in depth expertise throughout the schooling {industry}, with a specific give attention to constructing sturdy and environment friendly knowledge structure and pipelines.

Emma Arrigo is a Options Architect at AWS, specializing in schooling prospects throughout Australia. She makes a speciality of leveraging cloud know-how and machine studying to handle advanced enterprise challenges within the schooling sector. Emma’s ardour for knowledge extends past her skilled life, as evidenced by her canine named Information.

Leave a Reply

Your email address will not be published. Required fields are marked *