Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Basis mannequin (FM) coaching and inference has led to a big improve in computational wants throughout the business. These fashions require large quantities of accelerated compute to coach and function successfully, pushing the boundaries of conventional computing infrastructure. They require environment friendly techniques for distributing workloads throughout a number of GPU accelerated servers, and optimizing developer velocity in addition to efficiency.

Ray is an open supply framework that makes it easy to create, deploy, and optimize distributed Python jobs. At its core, Ray affords a unified programming mannequin that enables builders to seamlessly scale their functions from a single machine to a distributed cluster. It gives a set of high-level APIs for duties, actors, and information that summary away the complexities of distributed computing, enabling builders to give attention to the core logic of their functions. Ray promotes the identical coding patterns for each a easy machine studying (ML) experiment and a scalable, resilient manufacturing software. Ray’s key options embody environment friendly activity scheduling, fault tolerance, and computerized useful resource administration, making it a robust software for constructing a variety of distributed functions, from ML fashions to real-time information processing pipelines. With its rising ecosystem of libraries and instruments, Ray has turn out to be a well-liked selection for organizations wanting to make use of the ability of distributed computing to sort out advanced and data-intensive issues.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the flexibleness to create and use your individual software program stack, but in addition gives optimum efficiency via identical backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of Ray gives a robust framework to scale up your generative AI workloads.

On this submit, we display the steps concerned in working Ray jobs on SageMaker HyperPod.

Overview of Ray

This part gives a high-level overview of the Ray instruments and frameworks for AI/ML workloads. We primarily give attention to ML coaching use instances.

Ray is an open-source distributed computing framework designed to run extremely scalable and parallel Python functions. Ray manages, executes, and optimizes compute wants throughout AI workloads. It unifies infrastructure via a single, versatile framework—enabling AI workloads from information processing, to mannequin coaching, to mannequin serving and past.

For distributed jobs, Ray gives intuitive instruments for parallelizing and scaling ML workflows. It permits builders to give attention to their coaching logic with out the complexities of useful resource allocation, activity scheduling, and inter-node communication.

At a excessive degree, Ray is made up of three layers:

Ray Core: The inspiration of Ray, offering primitives for parallel and distributed computing
Ray AI libraries:
- Ray Practice – A library that simplifies distributed coaching by providing built-in help for fashionable ML frameworks like PyTorch, TensorFlow, and Hugging Face
- Ray Tune – A library for scalable hyperparameter tuning
- Ray Serve – A library for distributed mannequin deployment and serving
Ray clusters: A distributed computing platform the place employee nodes run person code as Ray duties and actors, usually within the cloud

On this submit, we dive deep into working Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and a lot of linked employee nodes. The top node orchestrates activity scheduling, useful resource allocation, and communication between nodes. The ray employee nodes execute the distributed workloads utilizing Ray duties and actors, equivalent to mannequin coaching or information preprocessing.

Ray clusters and Kubernetes clusters pair properly collectively. By working a Ray cluster on Kubernetes utilizing the KubeRay operator, each Ray customers and Kubernetes directors profit from the graceful path from improvement to manufacturing. For this use case, we use a SageMaker HyperPod cluster orchestrated via Amazon Elastic Kubernetes Service (Amazon EKS).

The KubeRay operator lets you run a Ray cluster on a Kubernetes cluster. KubeRay creates the next {custom} useful resource definitions (CRDs):

RayCluster – The first useful resource for managing Ray situations on Kubernetes. The nodes in a Ray cluster manifest as pods within the Kubernetes cluster.
RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting duties or batches of duties to be executed by the Ray cluster. A RayJob additionally manages the lifecycle of the Ray cluster, making it ephemeral by robotically spinning up the cluster when the job is submitted and shutting it down when the job is full.
RayService – A Ray cluster and a Serve software that runs on prime of it right into a single Kubernetes manifest. It permits for the deployment of Ray functions that have to be uncovered for exterior communication, sometimes via a service endpoint.

For the rest of this submit, we don’t give attention to RayJob or RayService; we give attention to making a persistent Ray cluster to run distributed ML coaching jobs.

When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we’ll dive deeper into later on this submit. This mixture gives an answer for dealing with dynamic workloads, sustaining excessive availability, and offering seamless restoration from node failures, which is essential for long-running jobs.

Overview of SageMaker HyperPod

On this part, we introduce SageMaker HyperPod and its built-in resiliency options to offer infrastructure stability.

Generative AI workloads equivalent to coaching, inference, and fine-tuning contain constructing, sustaining, and optimizing giant clusters of hundreds of GPU accelerated situations. For distributed coaching, the purpose is to effectively parallelize workloads throughout these situations so as to maximize cluster utilization and reduce time to coach. For giant-scale inference, it’s vital to attenuate latency, maximize throughput, and seamlessly scale throughout these situations for the perfect person expertise. SageMaker HyperPod is a purpose-built infrastructure to deal with these wants. It removes the undifferentiated heavy lifting concerned in constructing, sustaining, and optimizing a big GPU accelerated cluster. It additionally gives flexibility to totally customise your coaching or inference setting and compose your individual software program stack. You should utilize both Slurm or Amazon EKS for orchestration with SageMaker HyperPod.

Because of their large dimension and the necessity to prepare on giant quantities of knowledge, FMs are sometimes educated and deployed on giant compute clusters composed of hundreds of AI accelerators equivalent to GPUs and AWS Trainium. A single failure in one among these thousand accelerators can interrupt your complete coaching course of, requiring handbook intervention to determine, isolate, debug, restore, and get better the defective node within the cluster. This workflow can take a number of hours for every failure and because the scale of the cluster grows, it’s widespread to see a failure each few days and even each few hours. SageMaker HyperPod gives resiliency towards infrastructure failures by making use of brokers that constantly run well being checks on cluster situations, repair the unhealthy situations, reload the final legitimate checkpoint, and resume the coaching—with out person intervention. In consequence, you’ll be able to prepare your fashions as much as 40% sooner. You may also SSH into an occasion within the cluster for debugging and collect insights on hardware-level optimization throughout multi-node coaching. Orchestrators like Slurm or Amazon EKS facilitate environment friendly allocation and administration of sources, present optimum job scheduling, monitor useful resource utilization, and automate fault tolerance.

Resolution overview

This part gives an summary of the right way to run Ray jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the structure and the method of making a SageMaker HyperPod cluster, putting in the KubeRay operator, and deploying a Ray coaching job.

Though this submit gives a step-by-step information to manually create the cluster, be happy to take a look at the aws-do-ray mission, which goals to simplify the deployment and scaling of distributed Python software utilizing Ray on Amazon EKS or SageMaker HyperPod. It makes use of Docker to containerize the instruments essential to deploy and handle Ray clusters, jobs, and companies. Along with the aws-do-ray mission, we’d like to spotlight the Amazon SageMaker Hyperpod EKS workshop, which affords an end-to-end expertise for working varied workloads on SageMaker Hyperpod clusters. There are a number of examples of coaching and inference workloads from the GitHub repository awsome-distributed-training.

As launched earlier on this submit, KubeRay simplifies the deployment and administration of Ray functions on Kubernetes. The next diagram illustrates the answer structure.

Create a SageMaker HyperPod cluster

Conditions

Earlier than deploying Ray on SageMaker HyperPod, you want a HyperPod cluster:

In case you want to deploy HyperPod on an present EKS cluster, please observe the directions right here which embody:

EKS cluster – You’ll be able to affiliate SageMaker HyperPod compute to an present EKS cluster that satisfies the set of stipulations. Alternatively and beneficial, you’ll be able to deploy a ready-made EKS cluster with a single AWS CloudFormation template. Consult with the GitHub repo for directions on organising an EKS cluster.
Customized sources – Operating multi-node distributed coaching requires varied sources, equivalent to gadget plugins, Container Storage Interface (CSI) drivers, and coaching operators, to be pre-deployed on the EKS cluster. You additionally have to deploy further sources for the well being monitoring agent and deep well being test. HyperPodHelmCharts simplify the method utilizing Helm, one among mostly used bundle mangers for Kubernetes. Consult with Set up packages on the Amazon EKS cluster utilizing Helm for set up directions.

The next present an instance workflow for making a HyperPod cluster on an present EKS Cluster after deploying stipulations. That is for reference solely and never required for the short deploy possibility.

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        },
        {
            "InstanceGroupName": "head-group",
            "InstanceType": "ml.m5.2xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "${SECURITY_GROUP_ID}"
        ],
        "Subnets": [
            "${SUBNET_ID}"
        ]
    },
    "NodeRecovery": "Automated"
}
EOL

The supplied configuration file accommodates two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep well being test at any time when new GPU or Trainium situations are added
“NodeRecovery”: “Automated” – Permits SageMaker HyperPod automated node restoration

You’ll be able to create a SageMaker HyperPod compute with the next AWS Command Line Interface (AWS CLI) command (AWS CLI model 2.17.47 or newer is required):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To confirm the cluster standing, you should use the next command:

aws sagemaker list-clusters --output desk

This command shows the cluster particulars, together with the cluster title, standing, and creation time:

------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                    ListClusters                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                 ClusterSummaries                                                                 ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||                           ClusterArn                           |        ClusterName        | ClusterStatus  |           CreationTime             ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||  arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf |         ml-cluster        |   InService     |  2025-03-03T16:45:05.320000+00:00  ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|

Alternatively, you’ll be able to confirm the cluster standing on the SageMaker console. After a quick interval, you’ll be able to observe that the standing for the nodes transitions to Operating.

Create an FSx for Lustre shared file system

For us to deploy the Ray cluster, we want the SageMaker HyperPod cluster to be up and working, and moreover we want a shared storage quantity (for instance, an Amazon FSx for Lustre file system). It is a shared file system that the SageMaker HyperPod nodes can entry. This file system could be provisioned statically earlier than launching your SageMaker HyperPod cluster or dynamically afterwards.

Specifying a shared storage location (equivalent to cloud storage or NFS) is non-compulsory for single-node clusters, however it’s required for multi-node clusters. Utilizing an area path will increase an error throughout checkpointing for multi-node clusters.

The Amazon FSx for Lustre CSI driver makes use of IAM roles for service accounts (IRSA) to authenticate AWS API calls. To make use of IRSA, an IAM OpenID Join (OIDC) supplier must be related to the OIDC issuer URL that comes provisioned your EKS cluster.

Create an IAM OIDC identification supplier on your cluster with the next command:

eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo replace
helm improve --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
  --namespace kube-system

This Helm chart features a service account named fsx-csi-controller-sa that will get deployed within the kube-system namespace.

Use the eksctl CLI to create an AWS Id and Entry Administration (IAM) position sure to the service account utilized by the motive force, attaching the AmazonFSxFullAccess AWS managed coverage:

eksctl create iamserviceaccount 
  --name fsx-csi-controller-sa 
  --override-existing-serviceaccounts 
  --namespace kube-system 
  --cluster $EKS_CLUSTER_NAME 
  --attach-policy-arn arn:aws:iam::aws:coverage/AmazonFSxFullAccess 
  --approve 
  --role-name AmazonEKSFSxLustreCSIDriverFullAccess 
  --region $AWS_REGION

The --override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a brand new one and updates the metadata of the present service account as a substitute.

Annotate the motive force’s service account with the Amazon Useful resource Identify (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM position that was created:

SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Position.Arn' --output textual content)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa 
  eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true

This annotation lets the motive force know what IAM position it ought to use to work together with the FSx for Lustre service in your behalf.

Confirm that the service account has been correctly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the modifications to take impact:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two choices for provisioning a file system:

Dynamic provisioning – This feature makes use of Persistent Quantity Claims (PVCs) in Kubernetes. You outline a PVC with desired storage specs. The CSI driver robotically provisions the FSx for Lustre file system for you based mostly on the PVC request. This enables for easy scaling and eliminates the necessity to manually create file techniques.
Static provisioning – On this methodology, you manually create the FSx for Lustre file system earlier than utilizing the CSI driver. You’ll need to configure particulars like subnet ID and safety teams for the file system. Then, you should use the motive force to mount this pre-created file system inside your container as a quantity.

For this instance, we use dynamic provisioning. Begin by making a storage class that makes use of the fsx.csi.aws.com provisioner:

cat < storageclass.yaml
sort: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  title: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: ${SUBNET_ID}
  securityGroupIds: ${SECURITYGROUP_ID}
  deploymentType: PERSISTENT_2
  automaticBackupRetentionDays: "0"
  copyTagsToBackups: "true"
  perUnitStorageThroughput: "250"
  dataCompressionType: "LZ4"
  fileSystemTypeVersion: "2.12"
mountOptions:
  - flock
EOF

kubectl apply -f storageclass.yaml

SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Ought to be the identical personal subnet that was used for HyperPod creation.
SECURITYGROUP_ID: The safety group IDs that will likely be hooked up to the file system. Ought to be the identical Safety Group ID that’s utilized in HyperPod and EKS.

Subsequent, create a PVC that makes use of the fsx-claim storage declare:

cat < pvc.yaml
apiVersion: v1
sort: PersistentVolumeClaim
metadata:
  title: fsx-claim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: fsx-sc
  sources:
    requests:
      storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will begin the dynamic provisioning of an FSx for Lustre file system based mostly on the specs supplied within the storage class.

Create the Ray cluster

Now that we’ve got each the SageMaker HyperPod cluster and the FSx for Lustre file system created, we will arrange the Ray cluster:

Arrange dependencies. We are going to create a brand new namespace in our Kubernetes cluster and set up the KubeRay operator utilizing a Helm chart.

We advocate utilizing KubeRay operator model 1.2.0 or greater, which helps computerized Ray Pod eviction and alternative in case of failures (for instance, {hardware} points on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo replace
#Set up each CRDs and Kuberay operator v1.2.0
helm set up kuberay-operator kuberay/kuberay-operator --version 1.2.0 --namespace kuberay
# Kuberay operator pod will likely be deployed onto head pod
kubectl get pods --namespace kuberay

Create a Ray Container Picture for the Ray Cluster manifest. With the current deprecation of the `rayproject/ray-ml` pictures ranging from Ray model 2.31.0, it’s essential to create a {custom} container picture for our Ray cluster. Subsequently, we’ll construct on prime of the `rayproject/ray:2.42.1-py310-gpu` picture, which has all obligatory Ray dependencies, and embody our coaching dependencies to construct our personal {custom} picture. Please be happy to switch this Dockerfile as you would like.

First, create a Dockerfile that builds upon the bottom Ray GPU picture and contains solely the required dependencies:

cat < Dockerfile
 
FROM rayproject/ray:2.42.1-py310-gpu
# Set up Python dependencies for PyTorch, Ray, Hugging Face, and extra
RUN pip set up --no-cache-dir 
    torch torchvision torchaudio 
    numpy 
    pytorch-lightning 
    transformers datasets consider tqdm click on 
    ray[train] ray[air] 
    ray[train-torch] ray[train-lightning] 
    torchdata 
    torchmetrics 
    torch_optimizer 
    speed up 
    scikit-learn 
    Pillow==9.5.0 
    protobuf==3.20.3
 
RUN pip set up --upgrade datasets transformers
 
# Set the person
USER ray
WORKDIR /house/ray
 
# Confirm ray set up
RUN which ray && 
    ray –-version
  
# Default command
CMD [ "/bin/bash" ]
 
EOF

Then, construct and push the picture to your container registry (Amazon ECR) utilizing the supplied script:

export AWS_REGION=$(aws configure get area)
export ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
 
echo "This course of might take 10-Quarter-hour to finish..."
 
echo "Constructing picture..."
 
docker construct --platform linux/amd64 -t ${REGISTRY}aws-ray-custom:newest .
 
# Create registry if wanted
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "aws-ray-custom" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
    aws ecr create-repository --repository-name aws-ray-custom
fi
 
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password --region $AWS_REGION| docker login --username AWS --password-stdin $REGISTRY
 
echo "Pushing picture to $REGISTRY ..."
 
# Push picture to registry
docker picture push ${REGISTRY}aws-ray-custom:newest

Now, our Ray container picture is in Amazon ECR with all obligatory Ray dependencies, in addition to code library dependencies.

Create a Ray cluster manifest. We use a Ray cluster to host our coaching jobs. The Ray cluster is the first useful resource for managing Ray situations on Kubernetes. It represents a cluster of Ray nodes, together with a head node and a number of employee nodes. The Ray cluster CRD determines how the Ray nodes are arrange, how they impart, and the way sources are allotted amongst them. The nodes in a Ray cluster manifest as pods within the EKS or SageMaker HyperPod cluster.

Notice that there are two distinct sections within the cluster manifest. Whereas the `headGroupSpec` defines the top node of the Ray Cluster, the `workerGroupSpecs` outline the employee nodes of the Ray Cluster. Whereas a job might technically run on the Head node as properly, it’s common to separate the top node from the precise employee nodes the place jobs are executed. Subsequently, the occasion for the top node can sometimes be a smaller occasion (i.e. we selected a m5.2xlarge). For the reason that head node additionally manages cluster-level metadata, it may be helpful to have it run on a non-GPU node to attenuate the danger of node failure (as GPU generally is a potential supply of node failure).

cat <<'EOF' > raycluster.yaml
apiVersion: ray.io/v1alpha1
sort: RayCluster
metadata:
  title: rayml
  labels:
    controller-tools.k8s.io: "1.0"
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray begin` command.
    # See https://github.com/ray-project/kuberay/blob/grasp/docs/steerage/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/newest/cluster/cli.html#ray-start for all accessible choices in `rayStartParams`.
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        #        nodeSelector:  
        #node.kubernetes.io/instance-type: "ml.m5.2xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-head
          picture: ${REGISTRY}aws-ray-custom:newest     ## IMAGE: Right here you might select which picture your head pod will run
          env:                                ## ENV: Right here is the place you'll be able to ship stuff to the top pod
            - title: RAY_GRAFANA_IFRAME_HOST   ## PROMETHEUS AND GRAFANA
              worth: http://localhost:3000
            - title: RAY_GRAFANA_HOST
              worth: http://prometheus-grafana.prometheus-system.svc:80
            - title: RAY_PROMETHEUS_HOST
              worth: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits on your head pod
              cpu: 1
              reminiscence: 8Gi
            requests:                                    ## REQUESTS: Set useful resource requests on your head pod
              cpu: 1
              reminiscence: 8Gi
          ports:
          - containerPort: 6379
            title: gcs-server
          - containerPort: 8265 # Ray dashboard
            title: dashboard
          - containerPort: 10001
            title: consumer
          - containerPort: 8000
            title: serve
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: fsx-storage
            mountPath: /fsx
          - title: ray-logs
            mountPath: /tmp/ray
        volumes:
          - title: ray-logs
            emptyDir: {}
          - title: fsx-storage
            persistentVolumeClaim:
              claimName: fsx-claim
  workerGroupSpecs:
  # the pod replicas on this group typed employee
  - replicas: 4                                    ## REPLICAS: What number of employee pods you need 
    minReplicas: 1
    maxReplicas: 10
    # logical group title, for this known as small-group, additionally could be useful
    groupName: gpu-group
    rayStartParams:
      num-gpus: "8"
    #pod template
    template:
      spec:
        #nodeSelector:
        # node.kubernetes.io/instance-type: "ml.p5.48xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-worker
          picture: ${REGISTRY}aws-ray-custom:newest             ## IMAGE: Right here you might select which picture your head node will run
          env:
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits on your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32  
            requests:                                    ## REQUESTS: Set useful resource requests on your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: ray-logs
            mountPath: /tmp/ray
          - title: fsx-storage
            mountPath: /fsx
        volumes:
        - title: fsx-storage
          persistentVolumeClaim:
            claimName: fsx-claim
        - title: ray-logs
          emptyDir: {}
EOF

Deploy the Ray cluster:

envsubst < raycluster.yaml | kubectl apply -f -

Optionally, expose the Ray dashboard utilizing port forwarding:

# Will get title of kubectl service that runs the top pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk '{print $1}' | head -n 1)
# Port forwards the dashboard from the top pod service
kubectl port-forward --address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you’ll be able to go to http://localhost:8265/ to go to the Ray Dashboard.

To launch a coaching job, there are a number of choices:
1. Use the Ray jobs submission SDK, the place you’ll be able to submit jobs to the Ray cluster via the Ray dashboard port (8265 by default) the place Ray listens for job requests. To be taught extra, see Quickstart utilizing the Ray Jobs CLI.
2. Execute a Ray job within the head pod the place you exec immediately into the top pod after which submit your job. To be taught extra, see RayCluster Quickstart.

For this instance, we use the primary methodology and submit the job via the SDK. Subsequently, we merely run from an area setting the place the coaching code is offered in --working-dir. Relative to this path, we specify the principle coaching Python script situated at --train.py
Throughout the working-dir folder, we will additionally embody further scripts we’d have to run the coaching.

The fsdp-ray.py instance is situated in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py within the aws-do-ray GitHub repo.

# Inside jobs/ folder
ray job submit --address http://localhost:8265 --working-dir "fsdp-ray" -- python3 fsdp-ray.py

For our Python coaching script to run, we want to verify our coaching scripts are appropriately arrange to make use of Ray. This contains the next steps:

Configure a mannequin to run distributed and on the right CPU/GPU gadget
Configure an information loader to shard information throughout the staff and place information on the right CPU or GPU gadget
Configure a coaching perform to report metrics and save checkpoints
Configure scaling and CPU or GPU useful resource necessities for a coaching job
Launch a distributed coaching job with a TorchTrainer class

For additional particulars on the right way to regulate your present coaching script to get essentially the most out of Ray, consult with the Ray documentation.

The next diagram illustrates the entire structure you’ve constructed after finishing these steps.

Implement coaching job resiliency with the job auto resume performance

Ray is designed with sturdy fault tolerance mechanisms to offer resilience in distributed techniques the place failures are inevitable. These failures usually fall into two classes: application-level failures, which stem from bugs in person code or exterior system points, and system-level failures, brought on by node crashes, community disruptions, or inside bugs in Ray. To deal with these challenges, Ray gives instruments and techniques that allow functions to detect, get better, and adapt seamlessly, offering reliability and efficiency in distributed environments. On this part, we take a look at two of the commonest kinds of failures, and the right way to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Practice employee failures and Ray employee node failures.

Ray Practice employee – It is a employee course of particularly used for coaching duties inside Ray Practice, Ray’s distributed coaching library. These staff deal with particular person duties or shards of a distributed coaching job. Every employee is chargeable for processing a portion of the info, coaching a subset of the mannequin, or performing computation throughout distributed coaching. They’re coordinated by the Ray Practice orchestration logic to collectively prepare a mannequin.
Ray employee node – On the Ray degree, this can be a Ray node in a Ray cluster. It’s a part of the Ray cluster infrastructure and is chargeable for working duties, actors, and different processes as orchestrated by the Ray head node. Every employee node can host a number of Ray processes that execute duties or handle distributed objects. On the Kubernetes degree, a Ray employee node is a Kubernetes pod that’s managed by a KubeRay operator. For this submit, we will likely be speaking concerning the Ray employee nodes on the Kubernetes degree, so we’ll consult with them as pods.

On the time of writing, there are not any official updates concerning head pod fault tolerance and auto resume capabilities. Although head pod failures are uncommon, within the unlikely occasion of such a failure, you will want to manually restart your coaching job. Nonetheless, you’ll be able to nonetheless resume progress from the final saved checkpoint. To reduce the danger of hardware-related head pod failures, it’s suggested to position the top pod on a devoted, CPU-only SageMaker HyperPod node, as a result of GPU failures are a standard coaching job failure level.

Ray Practice employee failures

Ray Practice is designed with fault tolerance to deal with employee failures, equivalent to RayActorErrors. When a failure happens, the affected staff are stopped, and new ones are robotically began to take care of operations. Nonetheless, for coaching progress to proceed seamlessly after a failure, saving and loading checkpoints is important. With out correct checkpointing, the coaching script will restart, however all progress will likely be misplaced. Checkpointing is due to this fact a crucial part of Ray Practice’s fault tolerance mechanism and must be applied in your code.

Automated restoration

When a failure is detected, Ray shuts down failed staff and provisions new ones. Though this occurs, we will inform the coaching perform to at all times hold retrying till coaching can proceed. Every occasion of restoration from a employee failure is taken into account a retry. We will set the variety of retries via the max_failures attribute of the FailureConfig, which is ready within the RunConfig handed to the Coach (for instance, TorchTrainer). See the next code:

from ray.prepare import RunConfig, FailureConfig
# Tries to get better a run as much as this many occasions.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No restrict on the variety of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For extra info, see Dealing with Failures and Node Preemption.

Checkpoints

A checkpoint in Ray Practice is a light-weight interface representing a listing saved both domestically or remotely. For instance, a cloud-based checkpoint would possibly level to s3://my-bucket/checkpoint-dir, and an area checkpoint would possibly level to /tmp/checkpoint-dir. To be taught extra, see Saving checkpoints throughout coaching.

To avoid wasting a checkpoint within the coaching loop, you first want to jot down your checkpoint to an area listing, which could be non permanent. When saving, you should use checkpoint utilities from different frameworks like torch.save, pl.Coach.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Mannequin.save, and extra. You then create a checkpoint from the listing utilizing Checkpoint.from_directory. Lastly, report the checkpoint to Ray Practice utilizing ray.prepare.report(metrics, checkpoint=...). The metrics reported alongside the checkpoint are used to maintain observe of the best-performing checkpoints. Reporting will add the checkpoint to persistent storage.

In case you save checkpoints with ray.prepare.report(..., checkpoint=...) and run on a multi-node cluster, Ray Practice will increase an error if NFS or cloud storage isn’t arrange. It’s because Ray Practice expects all staff to have the ability to write the checkpoint to the identical persistent storage location.

Lastly, clear up the native non permanent listing to unlock disk house (for instance, by exiting the tempfile.TemporaryDirectory context). We will save a checkpoint each epoch or each few iterations.

The next diagram illustrates this setup.

The next code is an instance of saving checkpoints utilizing native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.prepare.torch
from ray import prepare
from ray.prepare import Checkpoint, ScalingConfig
from ray.prepare.torch import TorchTrainer


def train_func(config):
    n = 100
    # create a toy dataset
    # information   : X - dim = (n, 4)
    # goal : Y - dim = (n, 1)
    X = torch.Tensor(np.random.regular(0, 1, dimension=(n, 4)))
    Y = torch.Tensor(np.random.uniform(0, 1, dimension=(n, 1)))
    # toy neural community : 1-layer
    # Wrap the mannequin in DDP
    mannequin = ray.prepare.torch.prepare_model(nn.Linear(4, 1))
    criterion = nn.MSELoss()

    optimizer = Adam(mannequin.parameters(), lr=3e-4)
    for epoch in vary(config["num_epochs"]):
        y = mannequin.ahead(X)
        loss = criterion(y, Y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        metrics = {"loss": loss.merchandise()}

        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            checkpoint = None

            should_checkpoint = epoch % config.get("checkpoint_freq", 1) == 0
            # In normal DDP coaching, the place the mannequin is identical throughout all ranks,
            # solely the worldwide rank 0 employee wants to save lots of and report the checkpoint
            if prepare.get_context().get_world_rank() == 0 and should_checkpoint:
                torch.save(
                    mannequin.module.state_dict(),  # NOTE: Unwrap the mannequin.
                    os.path.be part of(temp_checkpoint_dir, "mannequin.pt"),
                )
                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            prepare.report(metrics, checkpoint=checkpoint)


coach = TorchTrainer(
    train_func,
    train_loop_config={"num_epochs": 5},
    scaling_config=ScalingConfig(num_workers=2),
)
outcome = coach.match()

Ray Practice additionally comes with CheckpointConfig, a method to configure checkpointing choices:

from ray.prepare import RunConfig, CheckpointConfig
# Instance 1: Solely hold the two *most up-to-date* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Instance 2: Solely hold the two *finest* checkpoints and delete the others.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        # *Greatest* checkpoints are decided by these params:
        checkpoint_score_attribute="mean_accuracy",
        checkpoint_score_order="max",
    ),
    # This may retailer checkpoints on S3.
    storage_path="s3://remote-bucket/location",
)

To restore coaching state from a checkpoint in case your coaching job had been to fail and retry, it is best to modify your coaching loop to auto resume after which restore a Ray Practice job. By pointing to the trail of your saved checkpoints, you’ll be able to restore your coach and proceed coaching. Right here’s a fast instance:

from ray.prepare.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
    path="~/ray_results/dl_trainer_restore",  # May also be a cloud storage path like S3
    datasets=get_datasets(),
)
outcome = restored_trainer.match()

To streamline restoration, you’ll be able to add auto resume logic to your script. This checks if a legitimate experiment listing exists and restores the coach if accessible. If not, it begins a brand new experiment:

experiment_path = "~/ray_results/dl_restore_autoresume"
if TorchTrainer.can_restore(experiment_path):
    coach = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
    coach = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        datasets=get_datasets(),
        scaling_config=prepare.ScalingConfig(num_workers=2),
        run_config=prepare.RunConfig(
            storage_path="~/ray_results",
            title="dl_restore_autoresume",
        ),
    )
outcome = coach.match()

To summarize, to offer fault tolerance and auto resume when utilizing Ray Practice libraries, set your max_failures parameter within the FailureConfig (we advocate setting it to -1 to verify it would hold retrying till the SageMaker HyperPod node is rebooted or changed), and be sure you have enabled checkpointing in your code.

Ray employee pod failures

Along with the aforementioned mechanisms to get better from Ray Practice employee failures, Ray additionally gives fault tolerance on the employee pod degree. When a employee pod fails (this contains situations during which the raylet course of fails), the working duties and actors on it would fail and the objects owned by employee processes of this pod will likely be misplaced. On this case, the duties, actors, and objects fault tolerance mechanisms will begin and attempt to get better the failures utilizing different employee pods.

These mechanisms will likely be implicitly dealt with by the Ray Practice library. To be taught extra concerning the underlying fault tolerance on the duties, actors, objects (applied on the Ray Core degree), see Fault Tolerance.

In observe, because of this in case of a employee pod failure, the next happens:

If there’s a free employee pod within the Ray cluster, Ray will get better the failed employee pod by changing it with the free employee pod.
If there isn’t a free employee pod, however within the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a brand new employee pod onto one of many free SageMaker HyperPod nodes. This pod will be part of the working Ray cluster and the failure will likely be recovered utilizing this new employee pod.

Within the context of KubeRay, Ray employee nodes are represented by Kubernetes pods, and failures at this degree can embody points equivalent to pod eviction or preemption brought on by software-level components.

Nonetheless, one other crucial state of affairs to think about is {hardware} failures. If the underlying SageMaker HyperPod node turns into unavailable because of a {hardware} concern, equivalent to a GPU error, it will inevitably trigger the Ray employee pod working on that node to fail as properly. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster begin and can reboot or exchange the defective node. After the brand new wholesome node is added into the SageMaker HyperPod cluster, Ray will schedule a brand new employee pod onto the SageMaker HyperPod node and get better the interrupted coaching. On this case, each the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency options work collectively seamlessly and be sure that even in case of a {hardware} failure, your ML coaching workload can auto resume and decide up from the place it was interrupted.

As you’ve seen, there are numerous built-in resiliency and fault-tolerance mechanisms that permit your Ray Practice workload on SageMaker HyperPod to get better and auto resume. As a result of these mechanisms will basically get better by restarting the coaching job, it’s essential that checkpointing is applied within the coaching script. It’s also usually suggested to save lots of the checkpoints on a shared and protracted path, equivalent to an Amazon Easy Storage Service (Amazon S3) bucket or FSx for Lustre file system.

Clear up

To delete your SageMaker HyperPod cluster created on this submit, you’ll be able to both use the SageMaker AI console or use the next AWS CLI command:

aws sagemaker delete-cluster --cluster-name

Cluster deletion will take a couple of minutes. You’ll be able to affirm profitable deletion after you see no clusters on the SageMaker AI console.

In case you used the CloudFormation stack to create sources, you’ll be able to delete it utilizing the next command:

aws cloudformation delete-stack --stack-name

Conclusion

This submit demonstrated the right way to arrange and deploy Ray clusters on SageMaker HyperPod, highlighting key issues equivalent to storage configuration and fault tolerance and auto resume mechanisms.

Operating Ray jobs on SageMaker HyperPod affords a robust resolution for distributed AI/ML workloads, combining the flexibleness of Ray with the sturdy infrastructure of SageMaker HyperPod. This integration gives enhanced resiliency and auto resume capabilities, that are essential for long-running and resource-intensive duties. By utilizing Ray’s distributed computing framework and the built-in options of SageMaker HyperPod, you’ll be able to effectively handle advanced ML workflows, particularly coaching workloads as coated on this submit. As AI/ML workloads proceed to develop in scale and complexity, the mixture of Ray and SageMaker HyperPod affords a scalable, resilient, and environment friendly platform for tackling essentially the most demanding computational challenges in machine studying.

To get began with SageMaker HyperPod, consult with the Amazon EKS Assist in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Information. To be taught extra concerning the aws-do-ray framework, consult with the GitHub repo.

In regards to the Authors

Mark Vinciguerra is an Affiliate Specialist Options Architect at Amazon Net Companies (AWS) based mostly in New York. He focuses on the Automotive and Manufacturing sector, specializing in serving to organizations architect, optimize, and scale synthetic intelligence and machine studying options, with explicit experience in autonomous automobile applied sciences. Previous to AWS, he went to Boston College and graduated with a level in Laptop Engineering.

Florian Stahl is a Worldwide Specialist Options Architect at AWS, based mostly in Hamburg, Germany. He makes a speciality of Synthetic Intelligence, Machine Studying, and Generative AI options, serving to prospects optimize and scale their AI/ML workloads on AWS. With a background as a Knowledge Scientist, Florian focuses on working with prospects within the Autonomous Automobile house, bringing deep technical experience to assist organizations design and implement subtle machine studying options. He works intently with prospects worldwide to remodel their AI initiatives and maximize the worth of their machine studying investments on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in Gen AI mannequin coaching and inference. He’s partnering with prime basis mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop has held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.

Alex Iankoulski is a Principal Options Architect, ML/AI Frameworks, who focuses on serving to prospects orchestrate their AI workloads utilizing containers and accelerated computing infrastructure on AWS. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges.