Migrate from Normal brokers to Categorical brokers in Amazon MSK utilizing Amazon MSK Replicator

Amazon Managed Streaming for Apache Kafka (Amazon MSK) now affords a brand new dealer sort known as Categorical brokers. It’s designed to ship as much as 3 instances extra throughput per dealer, scale as much as 20 instances quicker, and scale back restoration time by 90% in comparison with Normal brokers operating Apache Kafka. Categorical brokers come preconfigured with Kafka finest practices by default, assist Kafka APIs, and supply the identical low latency efficiency that Amazon MSK clients anticipate, so you possibly can proceed utilizing current shopper purposes with none modifications. Categorical brokers present simple operations with hands-free storage administration by providing limitless storage with out pre-provisioning, eliminating disk-related bottlenecks. To be taught extra about Categorical brokers, check with Introducing Categorical brokers for Amazon MSK to ship excessive throughput and quicker scaling in your Kafka clusters.

Creating a brand new cluster with Categorical brokers is easy, as described in Amazon MSK Categorical brokers. Nevertheless, you probably have an current MSK cluster, it’s worthwhile to migrate to a brand new Categorical based mostly cluster. On this put up, we talk about how you must plan and carry out the migration to Categorical brokers in your current MSK workloads on Normal brokers. Categorical brokers provide a distinct person expertise and a distinct shared duty boundary, so utilizing them on an current cluster shouldn’t be potential. Nevertheless, you should use Amazon MSK Replicator to repeat all knowledge and metadata out of your current MSK cluster to a brand new cluster comprising of Categorical brokers.

MSK Replicator affords a built-in replication functionality to seamlessly replicate knowledge from one cluster to a different. It robotically scales the underlying assets, so you possibly can replicate knowledge on demand with out having to observe or scale capability. MSK Replicator additionally replicates Kafka metadata, together with subject configurations, entry management lists (ACLs), and shopper group offsets.

Within the following sections, we talk about the best way to use MSK Replicator to copy the info from a Normal dealer MSK cluster to an Categorical dealer MSK cluster and the steps concerned in migrating the shopper purposes from the previous cluster to the brand new cluster.

Planning your migration

Migrating from Normal brokers to Categorical brokers requires thorough planning and cautious consideration of varied elements. On this part, we talk about key points to deal with throughout the planning part.

Assessing the supply cluster’s infrastructure and wishes

It’s essential to guage the capability and well being of the present (supply) cluster to verify it might deal with further consumption throughout migration, as a result of MSK Replicator will retrieve knowledge from the supply cluster. Key checks embody:

CPU utilization – The mixed CPU Person and CPU System utilization per dealer ought to stay beneath 60%.
Community throughput – The cluster-to-cluster replication course of provides additional egress site visitors, as a result of it’d want to copy the present knowledge based mostly on enterprise necessities together with the incoming knowledge. As an example, if the ingress quantity is X GB/day and knowledge is retained within the cluster for two days, replicating the info from the earliest offset would trigger the whole egress quantity for replication to be 2X GB. The cluster should accommodate this elevated egress quantity.
Let’s take an instance the place in your current supply cluster you may have a mean knowledge ingress of 100 MBps and peak knowledge ingress of 400 MBps with retention of 48 hours. Let’s assume you may have one shopper of the info you produce to your Kafka cluster, which signifies that your egress site visitors will probably be similar in comparison with your ingress site visitors. Based mostly on this requirement, you should use the Amazon MSK sizing information to calculate the dealer capability it’s worthwhile to safely deal with this workload. Within the spreadsheet, you have to to supply your common and most ingress/egress site visitors within the cells, as proven within the following screenshot.

As a result of it’s worthwhile to replicate all the info produced in your Kafka cluster, the consumption will probably be greater than the common workload. Taking this into consideration, your total egress site visitors will probably be not less than twice the dimensions of your ingress site visitors.

Nevertheless, whenever you run a replication device, the ensuing egress site visitors will probably be greater than twice the ingress since you additionally want to copy the present knowledge together with the brand new incoming knowledge within the cluster. Within the previous instance, you may have a mean ingress of 100 MBps and you keep knowledge for 48 hours, which implies that you’ve a complete of roughly 18 TB of current knowledge in your supply cluster that must be copied over on prime of the brand new knowledge that’s coming by way of. Let’s additional assume that your objective for the replicator is to catch up in 30 hours. On this case, your replicator wants to repeat knowledge at 260 MBps (100 MBps for ingress site visitors + 160 MBps (18 TB/30 hours) for current knowledge) to catch up in 30 hours. The next determine illustrates this course of.

Due to this fact, within the sizing information’s egress cells, it’s worthwhile to add an extra 260 MBps to your common knowledge out and peak knowledge out to estimate the dimensions of the cluster you must provision to finish the replication safely and on time.

Replication instruments act as a shopper to the supply cluster, so there’s a likelihood that this replication shopper can eat greater bandwidth, which might negatively impression the present software shopper’s produce and eat requests. To regulate the replication shopper throughput, you should use a consumer-side Kafka quota within the supply cluster to restrict the replicator throughput. This makes positive that the replicator shopper will throttle when it goes past the restrict, thereby safeguarding the opposite shoppers. Nevertheless, if the quota is about too low, the replication throughput will endure and the replication may by no means finish. Based mostly on the previous instance, you possibly can set a quota for the replicator to be not less than 260 MBps, in any other case the replication won’t end in 30 hours.
Quantity throughput – Knowledge replication may contain studying from the earliest offset (based mostly on enterprise requirement), impacting your major storage quantity, which on this case is Amazon Elastic Block Retailer (Amazon EBS). The VolumeReadBytes and VolumeWriteBytes metrics ought to be checked to verify the supply cluster quantity throughput has further bandwidth to deal with any further learn from the disk. Relying on the cluster measurement and replication knowledge quantity, you must provision storage throughput within the cluster. With provisioned storage throughput, you possibly can enhance the Amazon EBS throughput as much as 1000 MBps relying on the dealer measurement. The utmost quantity throughput might be specified relying on dealer measurement and sort, as talked about in Handle storage throughput for Normal brokers in a Amazon MSK cluster. Based mostly on the previous instance, the replicator will begin studying from the disk and the quantity throughput of 260 MBps will probably be shared throughout all of the brokers. Nevertheless, current shoppers can lag, which is able to trigger studying from the disk, thereby rising the storage learn throughput. Additionally, there may be storage write throughput as a result of incoming knowledge from the producer. On this state of affairs, enabling provisioned storage throughput will enhance the general EBS quantity throughput (learn + write) in order that current producer and shopper efficiency doesn’t get impacted because of the replicator studying knowledge from EBS volumes.
Balanced partitions – Ensure partitions are well-distributed throughout brokers, with no skewed chief partitions.

Relying on the evaluation, you may have to vertically scale up or horizontally scale out the supply cluster earlier than migration.

Assessing the goal cluster’s infrastructure and wishes

Use the identical sizing device to estimate the dimensions of your Categorical dealer cluster. Usually, fewer Categorical brokers is perhaps wanted in comparison with Normal brokers for a similar workload as a result of relying on the occasion measurement, Categorical brokers permit as much as 3 times extra ingress throughput.

Configuring Categorical Brokers

Categorical brokers make use of opinionated and optimized Kafka configurations, so it’s necessary to distinguish between configurations which are read-only and people which are learn/write throughout planning. Learn/write broker-level configurations ought to be configured individually as a pre-migration step within the goal cluster. Though MSK Replicator will replicate most topic-level configurations, sure topic-level configurations are all the time set to default values in an Categorical cluster: replication-factor, min.insync.replicas, and unclean.chief.election.allow. If the default values differ from the supply cluster, these configurations will probably be overridden.

As a part of the metadata, MSK Replicator additionally copies sure ACL varieties, as talked about in Metadata replication. It doesn’t explicitly copy the write ACLs besides the deny ones. Due to this fact, when you’re utilizing SASL/SCRAM or mTLS authentication with ACLs reasonably than AWS Id and Entry Administration (IAM) authentication, write ACLs should be explicitly created within the goal cluster.

Shopper connectivity to the goal cluster

Deployment of the goal cluster can happen throughout the similar digital non-public cloud (VPC) or a distinct one. Contemplate any modifications to shopper connectivity, together with updates to safety teams and IAM insurance policies, throughout the planning part.

Migration technique: All of sudden vs. wave

Two migration methods might be adopted:

All of sudden – All matters are replicated to the goal cluster concurrently, and all purchasers are migrated directly. Though this strategy simplifies the method, it generates vital egress site visitors and entails dangers to a number of purchasers if points come up. Nevertheless, if there may be any failure, you possibly can roll again by redirecting the purchasers to make use of the supply cluster. It’s advisable to carry out the cutover throughout non-business hours and talk with stakeholders beforehand.
Wave – Migration is damaged into phases, transferring a subset of purchasers (based mostly on enterprise necessities) in every wave. After every part, the goal cluster’s efficiency might be evaluated earlier than continuing. This reduces dangers and builds confidence within the migration however requires meticulous planning, particularly for giant clusters with many microservices.

Every technique has its professionals and cons. Select the one which aligns finest with your online business wants. For insights, check with Goldman Sachs’ migration technique to maneuver from on-premises Kafka to Amazon MSK.

Cutover plan

Though MSK Replicator facilitates seamless knowledge replication with minimal downtime, it’s important to plot a transparent cutover plan. This contains coordinating with stakeholders, stopping producers and shoppers within the supply cluster, and restarting them within the goal cluster. If a failure happens, you possibly can roll again by redirecting the purchasers to make use of the supply cluster.

Schema registry

When migrating from a Normal dealer to an Categorical dealer cluster, schema registry concerns stay unaffected. Purchasers can proceed utilizing current schemas for each producing and consuming knowledge with Amazon MSK.

Resolution overview

On this setup, two Amazon MSK provisioned clusters are deployed: one with Normal brokers (supply) and the opposite with Categorical brokers (goal). Each clusters are situated in the identical AWS Area and VPC, with IAM authentication enabled. MSK Replicator is used to copy matters, knowledge, and configurations from the supply cluster to the goal cluster. The replicator is configured to keep up an identical subject names throughout each clusters, offering seamless replication with out requiring client-side modifications.

In the course of the first part, the supply MSK cluster handles shopper requests. Producers write to the clickstream subject within the supply cluster, and a shopper group with the group ID clickstream-consumer reads from the identical subject. The next diagram illustrates this structure.

When knowledge replication to the goal MSK cluster is full, we have to consider the well being of the goal cluster. After confirming the cluster is wholesome, we have to migrate the purchasers in a managed method. First, we have to cease the producers, reconfigure them to put in writing to the goal cluster, after which restart them. Then, we have to cease the shoppers after they’ve processed all remaining data within the supply cluster, reconfigure them to learn from the goal cluster, and restart them. The next diagram illustrates the brand new structure.

Migrate from Normal brokers to Categorical brokers in Amazon MSK utilizing Amazon MSK Replicator

After verifying that each one purchasers are functioning appropriately with the goal cluster utilizing Categorical brokers, we will safely decommission the supply MSK cluster with Normal brokers and the MSK Replicator.

Deployment Steps

On this part, we talk about the step-by-step course of to copy knowledge from an MSK Normal dealer cluster to an Categorical dealer cluster utilizing MSK Replicator and in addition the shopper migration technique. For the aim of the weblog, “” migration technique is used.

Provision the MSK cluster

Obtain the AWS CloudFormation template to provision the MSK cluster. Deploy the next in us-east-1 with stack title as migration.

It will create the VPC, subnets, and two Amazon MSK provisioned clusters: one with Normal brokers (supply) and one other with Categorical brokers (goal) throughout the VPC configured with IAM authentication. It is going to additionally create a Kafka shopper Amazon Elastic Compute Cloud (Amazon EC2) occasion the place from we will use the Kafka command line to create and consider Kafka matters and produce and eat messages to and from the subject.

Configure the MSK shopper

On the Amazon EC2 console, hook up with the EC2 occasion named migration-KafkaClientInstance1 utilizing Session Supervisor, a functionality of AWS Techniques Supervisor.

After you log in, it’s worthwhile to configure the supply MSK cluster bootstrap deal with to create a subject and publish knowledge to the cluster. You may get the bootstrap deal with for IAM authentication from the small print web page for the MSK cluster (migration-standard-broker-src-cluster) on the Amazon MSK console, beneath View Shopper Info. You additionally have to replace the producer.properties and shopper.properties information to replicate the bootstrap deal with of the usual dealer cluster.

sudo su - ec2-user

export BS_SRC=<>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=/BOOTSTRAP_SERVERS_CONFIG=${BS_SRC}/g" producer.properties 
sed -i "s/bootstrap.servers=/bootstrap.servers=${BS_SRC}/g" shopper.properties

Create a subject

Create a clickstream subject utilizing the next instructions:

/residence/ec2-user/kafka/bin/kafka-topics.sh --bootstrap-server=$BS_SRC 
--create --replication-factor 3 --partitions 3 
--topic clickstream 
--command-config=/residence/ec2-user/kafka/config/client_iam.properties

Produce and eat messages to and from the subject

Run the clickstream producer to generate occasions within the clickstream subject:

cd /residence/ec2-user/clickstream-producer-for-apache-kafka/

java -jar goal/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream 
-pfp /residence/ec2-user/producer.properties -nt 8 -rf 3600 -iam 
-gsr -gsrr <> -grn default-registry -gar

Open one other Session Supervisor occasion and from that shell, run the clickstream shopper to eat from the subject:

cd /residence/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar goal/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream 
-pfp /residence/ec2-user/shopper.properties -nt 3 -rf 3600 -iam 
-gsr -gsrr <> -grn default-registry

Preserve the producer and shopper operating. If not interrupted, the producer and shopper will run for 60 minutes earlier than it exits. The -rf parameter controls how lengthy the producer and shopper will run.

Create an MSK replicator

To create an MSK replicator, full the next steps:

On the Amazon MSK console, select Replicators within the navigation pane.
Select Create replicator.
Within the Replicator particulars part, enter a reputation and optionally available description.

Within the Supply cluster part, present the next info:
1. For Cluster area, select us-east-1.
2. For MSK cluster, enter the MSK cluster Amazon Useful resource Identify (ARN) for the Normal dealer.

After the supply cluster is chosen, it robotically selects the subnets related to the first cluster and the safety group related to the supply cluster. It’s also possible to choose further safety teams.

Guarantee that the safety teams have outbound guidelines to permit site visitors to your cluster’s safety teams. Additionally be sure that your cluster’s safety teams have inbound guidelines that settle for site visitors from the replicator safety teams supplied right here.

Within the Goal cluster part, for MSK cluster¸ enter the MSK cluster ARN for the Categorical dealer.

After the goal cluster is chosen, it robotically selects the subnets related to the first cluster and the safety group related to the supply cluster. It’s also possible to choose further safety teams.

Now let’s present the replicator settings.

Within the Replicator settings part, present the next info:
1. For the aim of the instance, we have now saved the matters to copy as a default worth that may replicate all matters from major to secondary cluster.
2. For Replicator beginning place, we configure it to copy from the earliest offset, in order that we will get all of the occasions from the beginning of the supply matters.
3. To configure the subject title within the secondary cluster as an identical to the first cluster, we choose Preserve the identical subject names for Copy settings. This makes positive that the MSK purchasers don’t want so as to add a prefix to the subject names.

1. For this instance, we maintain the Shopper Group Replication setting as default (ensure that it’s enabled to permit redirected purchasers resume processing knowledge from the final processed offset).
2. We set Goal Compression sort as None.

The Amazon MSK console will robotically create the required IAM insurance policies. Should you’re deploying utilizing the AWS Command Line Interface (AWS CLI), SDK, or AWS CloudFormation, you need to create the IAM coverage and use it as per your deployment course of.

Select Create to create the replicator.

The method will take round 15–20 minutes to deploy the replicator. When the MSK replicator is operating, this will probably be mirrored within the standing.

Monitor replication

When the MSK replicator is up and operating, monitor the MessageLag metric. This metric signifies what number of messages are but to be replicated from the supply MSK cluster to the goal MSK cluster. The MessageLag metric ought to come all the way down to 0.

Migrate purchasers from supply to focus on cluster

When the MessageLag metric reaches 0, it signifies that each one messages have been replicated from the supply MSK cluster to the goal MSK cluster. At this stage, you possibly can minimize over shopper purposes from the supply to the goal cluster. Earlier than initiating this step, affirm the well being of the goal cluster by reviewing the Amazon MSK metrics in Amazon CloudWatch and ensuring that the shopper purposes are functioning correctly. Then full the next steps:

Cease the producers writing knowledge to the supply (previous) cluster with Normal brokers and reconfigure them to put in writing to the goal (new) cluster with Categorical brokers.
Earlier than migrating the shoppers, be sure that the MaxOffsetLag metric for the shoppers has dropped to 0, confirming that they’ve processed all current knowledge within the supply cluster.
When this situation is met, cease the shoppers and reconfigure them to learn from the goal cluster.

The offset lag occurs if the patron is consuming slower than the speed the producer is producing knowledge. The flat line within the following metric visualization exhibits that the producer has stopped producing to the supply cluster whereas the patron connected to it continues to eat the present knowledge and finally consumes all the info, subsequently the metric goes to 0.

Now you possibly can replace the bootstrap deal with in producer.properties and shopper.properties to level to the goal Categorical based mostly MSK cluster. You may get the bootstrap deal with for IAM authentication from the MSK cluster (migration-express-broker-dest-cluster) on the Amazon MSK console beneath View Shopper Info.

export BS_TGT=<>
sed -i "s/BOOTSTRAP_SERVERS_CONFIG=.*/BOOTSTRAP_SERVERS_CONFIG=${BS_TGT}/g" producer.properties
sed -i "s/bootstrap.servers=.*/bootstrap.servers=${BS_TGT}/g" shopper.properties

Run the clickstream producer to generate occasions within the clickstream subject:

cd /residence/ec2-user/clickstream-producer-for-apache-kafka/

java -jar goal/KafkaClickstreamClient-1.0-SNAPSHOT.jar -t clickstream 
-pfp /residence/ec2-user/producer.properties -nt 8 -rf 60 -iam 
-gsr -gsrr <> -grn default-registry -gar

In one other Session Supervisor occasion and from that shell, run the clickstream shopper to eat from the subject:

cd /residence/ec2-user/clickstream-consumer-for-apache-kafka/

java -jar goal/KafkaClickstreamConsumer-1.0-SNAPSHOT.jar -t clickstream 
-pfp /residence/ec2-user/shopper.properties -nt 3 -rf 60 -iam 
-gsr -gsrr <> -grn default-registry

We will see that the producers and shoppers are actually producing and consuming to the goal Categorical based mostly MSK cluster. The producers and shoppers will run for 60 seconds earlier than they exit.

The next screenshot exhibits producer-produced messages to the brand new Categorical based mostly MSK cluster for 60 seconds.

Migrate stateful purposes

Stateful purposes comparable to Apache Spark and Apache Flink use their very own checkpointing mechanisms to retailer shopper offsets as a substitute of counting on Kafka’s shopper group offset mechanism. When migrating matters from a supply cluster to a goal cluster, the Kafka offsets within the supply will differ from these within the goal. In consequence, migrating a stateful software together with its state requires cautious consideration, as a result of the present offsets are incompatible with the replicated goal cluster’s offsets. So, it’s worthwhile to re-build the state once more by re-processing all of the replicated knowledge within the goal cluster.

Migrate Kafka Streams and KSQL purposes

Kafka Streams and KSQL purposes depend on inside matters for execution. For instance, changelog matters are used for state administration. It’s advisable to not replicate these inside changelog matters to the goal MSK cluster. As an alternative, the Kafka Streams software ought to be configured to start out from the earliest offset of the matters within the goal cluster. This enables the state to be rebuilt. Nevertheless, this methodology ends in duplicate processing, as a result of all the info within the subject is reprocessed. Due to this fact, the goal vacation spot (comparable to a database) have to be idempotent to deal with these duplicates successfully.

Categorical brokers don’t permit configuring phase.bytes to optimize efficiency. Due to this fact, the interior matters should be manually created earlier than the Kafka Streams software is migrated to the brand new Categorical based mostly cluster. For extra info, check with Utilizing Kafka Streams with MSK Categorical brokers and MSK Serverless.

Migrate Apache Spark purposes

Spark shops offsets in its checkpoint location, which ought to be a file system appropriate with HDFS, comparable to Amazon Easy Storage Service (Amazon S3). After migrating the Spark software to the goal MSK cluster, you must take away the checkpoint location, inflicting the Spark software to lose its state. To rebuild the state, configure the Spark software to start out processing from the earliest offset of the supply matters within the goal cluster. It will result in re-processing all the info from the beginning of the subject and subsequently will generate duplicate knowledge. Consequently, the goal vacation spot (comparable to a database) have to be idempotent to successfully deal with these duplicates.

Migrate Apache Flink purposes

Flink shops shopper offsets throughout the state of its Kafka supply operator. When checkpoints are accomplished, the Kafka supply commits the present consuming offset to supply consistency between Flink’s checkpoint state and the offsets dedicated on Kafka brokers. In contrast to different programs, Flink purposes don’t depend on the __consumer_offsets subject to trace offsets; as a substitute, they use the offsets saved in Flink’s state.

Throughout Flink software migration, one strategy is to start out the appliance and not using a Savepoint. This strategy discards the whole state and reverts to studying from the final dedicated offset of the patron group. Nevertheless, this prevents the appliance from precisely rebuilding the state of downstream Flink operators, resulting in discrepancies in computation outcomes. To deal with this, you possibly can both keep away from replicating the patron group of the Flink software or assign a brand new shopper group to the appliance when restarting it within the goal cluster. Moreover, configure the appliance to start out studying from the earliest offset of the supply matters. This permits re-processing all knowledge from the supply matters and rebuilding the state. Nevertheless, this methodology will end in duplicate knowledge, so the goal system (comparable to a database) have to be idempotent to deal with these duplicates successfully.

Alternatively, you possibly can reset the state of the Kafka supply operator. Flink makes use of operator IDs (UIDs) to map the state to particular operators. When restarting the appliance from a Savepoint, Flink matches the state to operators based mostly on their assigned IDs. It’s endorsed to assign a singular ID to every operator to allow seamless state restoration from Savepoints. To reset the state of the Kafka supply operator, change its operator ID. Passing the operator ID as a parameter in a configuration file can simplify this course of. Restart the Flink software with parameter --allowNonRestoredState (in case you are operating self-managed Flink). It will reset solely the state of the Kafka supply operator, leaving different operator states unaffected. In consequence, the Kafka supply operator resumes from the final dedicated offset of the patron group, avoiding full reprocessing and state rebuilding. Though this may nonetheless produce some duplicates within the output, it ends in no knowledge loss. This strategy is relevant solely when utilizing the DataStream API to construct Flink purposes.

Conclusion

Migrating from a Normal dealer MSK cluster to an Categorical dealer MSK cluster utilizing MSK Replicator gives a seamless, environment friendly transition with minimal downtime. By following the steps and techniques mentioned on this put up, you possibly can reap the benefits of the high-performance, cost-effective advantages of Categorical brokers whereas sustaining knowledge consistency and software uptime.

Able to optimize your Kafka infrastructure? Begin planning your migration to Amazon MSK Categorical brokers as we speak and expertise improved scalability, pace, and reliability. For extra particulars, check with the Amazon MSK Developer Information.

In regards to the Creator

Subham Rakshit is a Senior Streaming Options Architect for Analytics at AWS based mostly within the UK. He works with clients to design and construct streaming architectures to allow them to get worth from analyzing their streaming knowledge. His two little daughters maintain him occupied more often than not exterior work, and he loves fixing jigsaw puzzles with them. Join with him on LinkedIn.