Amazon SageMaker HyperPod introduces Amazon EKS assist

As we speak, we’re happy to announce Amazon Elastic Kubernetes Service (EKS) assist in Amazon SageMaker HyperPod — purpose-built infrastructure engineered with resilience at its core for basis mannequin (FM) improvement. This new functionality permits prospects to orchestrate HyperPod clusters utilizing EKS, combining the ability of Kubernetes with Amazon SageMaker HyperPod‘s resilient surroundings designed for coaching massive fashions. Amazon SageMaker HyperPod helps effectively scale throughout greater than a thousand synthetic intelligence (AI) accelerators, decreasing coaching time by as much as 40%.

Amazon SageMaker HyperPod now permits prospects to handle their clusters utilizing a Kubernetes-based interface. This integration permits seamless switching between Slurm and Amazon EKS for optimizing varied workloads, together with coaching, fine-tuning, experimentation, and inference. The CloudWatch Observability EKS add-on offers complete monitoring capabilities, providing insights into CPU, community, disk, and different low-level node metrics on a unified dashboard. This enhanced observability extends to useful resource utilization throughout the complete cluster, node-level metrics, pod-level efficiency, and container-specific utilization knowledge, facilitating environment friendly troubleshooting and optimization.

Launched at re:Invent 2023, Amazon SageMaker HyperPod has turn into a go-to answer for AI startups and enterprises trying to effectively practice and deploy massive scale fashions. It’s appropriate with SageMaker’s distributed coaching libraries, which provide Mannequin Parallel and Information Parallel software program optimizations that assist cut back coaching time by as much as 20%. SageMaker HyperPod routinely detects and repairs or replaces defective cases, enabling knowledge scientists to coach fashions uninterrupted for weeks or months. This permits knowledge scientists to concentrate on mannequin improvement, relatively than managing infrastructure.

The combination of Amazon EKS with Amazon SageMaker HyperPod makes use of some great benefits of Kubernetes, which has turn into in style for machine studying (ML) workloads attributable to its scalability and wealthy open-source tooling. Organizations usually standardize on Kubernetes for constructing functions, together with these required for generative AI use circumstances, because it permits reuse of capabilities throughout environments whereas assembly compliance and governance requirements. As we speak’s announcement permits prospects to scale and optimize useful resource utilization throughout greater than a thousand AI accelerators. This flexibility enhances the developer expertise, containerized app administration, and dynamic scaling for FM coaching and inference workloads.

Amazon EKS assist in Amazon SageMaker HyperPod strengthens resilience by deep well being checks, automated node restoration, and job auto-resume capabilities, guaranteeing uninterrupted coaching for giant scale and/or long-running jobs. Job administration might be streamlined with the non-obligatory HyperPod CLI, designed for Kubernetes environments, although prospects may use their very own CLI instruments. Integration with Amazon CloudWatch Container Insights offers superior observability, providing deeper insights into cluster efficiency, well being, and utilization. Moreover, knowledge scientists can use instruments like Kubeflow for automated ML workflows. The combination additionally contains Amazon SageMaker managed MLflow, offering a sturdy answer for experiment monitoring and mannequin administration.

At a excessive stage, Amazon SageMaker HyperPod cluster is created by the cloud admin utilizing the HyperPod cluster API and is absolutely managed by the HyperPod service, eradicating the undifferentiated heavy lifting concerned in constructing and optimizing ML infrastructure. Amazon EKS is used to orchestrate these HyperPod nodes, much like how Slurm orchestrates HyperPod nodes, offering prospects with a well-recognized Kubernetes-based administrator expertise.

Let’s discover find out how to get began with Amazon EKS assist in Amazon SageMaker HyperPod
I begin by making ready the state of affairs, checking the conditions, and creating an Amazon EKS cluster with a single AWS CloudFormation stack following the Amazon SageMaker HyperPod EKS workshop, configured with VPC and storage sources.

To create and handle Amazon SageMaker HyperPod clusters, I can use both the AWS Administration Console or AWS Command Line Interface (AWS CLI). Utilizing the AWS CLI, I specify my cluster configuration in a JSON file. I select the Amazon EKS cluster created beforehand because the orchestrator of the SageMaker HyperPod Cluster. Then, I create the cluster employee nodes that I name “worker-group-1”, with a non-public Subnet, NodeRecovery set to Automated to allow computerized node restoration and for OnStartDeepHealthChecks I add InstanceStress and InstanceConnectivity to allow deep well being checks.

cat > eli-cluster-config.json << EOL
{
    "ClusterName": "example-hp-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 32,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ],
        },
  ....
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "ResilienceConfig": {
        "NodeRecovery": "Automated"
    }
}
EOL

You may add InstanceStorageConfigs to provision and mount a further Amazon EBS volumes on HyperPod nodes.

To create the cluster utilizing the SageMaker HyperPod APIs, I run the next AWS CLI command:

aws sagemaker create-cluster  
--cli-input-json file://eli-cluster-config.json

The AWS command returns the ARN of the brand new HyperPod cluster.

{
"ClusterArn": "arn:aws:sagemaker:us-east-2:ACCOUNT-ID:cluster/wccy5z4n4m49"
}

I then confirm the HyperPod cluster standing within the SageMaker Console, awaiting till the standing adjustments to InService.

Alternatively, you’ll be able to examine the cluster standing utilizing the AWS CLI operating the describe-cluster command:

aws sagemaker describe-cluster --cluster-name my-hyperpod-cluster

As soon as the cluster is prepared, I can entry the SageMaker HyperPod cluster nodes. For many operations, I can use kubectl instructions to handle sources and jobs from my improvement surroundings, utilizing the total energy of Kubernetes orchestration whereas benefiting from SageMaker HyperPod’s managed infrastructure. On this event, for superior troubleshooting or direct node entry, I take advantage of AWS Techniques Supervisor (SSM) to log into particular person nodes, following the directions within the Entry your SageMaker HyperPod cluster nodes web page.

To run jobs on the SageMaker HyperPod cluster orchestrated by EKS, I observe the steps outlined within the Run jobs on SageMaker HyperPod cluster by Amazon EKS web page. You should utilize the HyperPod CLI and the native kubectl command to search out avaible HyperPod clusters and submit coaching jobs (Pods). For managing ML experiments and coaching runs, you should use Kubeflow Coaching Operator, Kueue and Amazon SageMaker-managed MLflow.

Lastly, within the SageMaker Console, I can view the Standing and Kubernetes model of lately added EKS clusters, offering a complete overview of my SageMaker HyperPod surroundings.

And I can monitor cluster efficiency and well being metrics utilizing Amazon CloudWatch Container Insights.

Issues to know
Listed below are some key issues it’s best to find out about Amazon EKS assist in Amazon SageMaker HyperPod:

Resilient Atmosphere – This integration offers a extra resilient coaching surroundings with deep well being checks, automated node restoration, and job auto-resume. SageMaker HyperPod routinely detects, diagnoses, and recovers from faults, permitting you to repeatedly practice basis fashions for weeks or months with out disruption. This may cut back coaching time by as much as 40%.

Enhanced GPU Observability – Amazon CloudWatch Container Insights offers detailed metrics and logs on your containerized functions and microservices. This allows complete monitoring of cluster efficiency and well being.

Scientist-Pleasant Software – This launch features a customized HyperPod CLI for job administration, Kubeflow Coaching Operators for distributed coaching, Kueue for scheduling, and integration with SageMaker Managed MLflow for experiment monitoring. It additionally works with SageMaker’s distributed coaching libraries, which give Mannequin Parallel and Information Parallel optimizations to considerably cut back coaching time. These libraries, mixed with auto-resumption of jobs, allow environment friendly and uninterrupted coaching of enormous fashions.

Versatile Useful resource Utilization – This integration enhances developer expertise and scalability for FM workloads. Information scientists can effectively share compute capability throughout coaching and inference duties. You should utilize your present Amazon EKS clusters or create and fix new ones to HyperPod compute, convey your personal instruments for job submission, queuing and monitoring.

To get began with Amazon SageMaker HyperPod on Amazon EKS, you’ll be able to discover sources such because the SageMaker HyperPod EKS Workshop, the aws-do-hyperpod venture, and the awsome-distributed-training venture. This launch is usually obtainable within the AWS Areas the place Amazon SageMaker HyperPod is accessible besides Europe(London). For pricing data, go to the Amazon SageMaker Pricing web page.

This weblog publish was a collaborative effort. I wish to thank Manoj Ravi, Adhesh Garg, Tomonori Shimomura, Alex Iankoulski, Anoop Saha, and the complete workforce for his or her important contributions in compiling and refining the knowledge introduced right here. Their collective experience was essential in creating this complete article.

– Eli.

Leave a Reply Cancel reply

Related News

Discover Cisco IOS XE Automation at Cisco Reside US 2025

New Amazon EC2 P6-B200 cases powered by NVIDIA Blackwell GPUs to speed up AI improvements