{"id":956,"date":"2025-04-02T22:21:27","date_gmt":"2025-04-02T22:21:27","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=956"},"modified":"2025-04-02T22:21:27","modified_gmt":"2025-04-02T22:21:27","slug":"ray-jobs-on-amazon-sagemaker-hyperpod-scalable-and-resilient-distributed-ai","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=956","title":{"rendered":"Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div id=\"\">\n<p>Basis mannequin (FM) coaching and inference has led to a big improve in computational wants throughout the business. These fashions require large quantities of accelerated compute to coach and function successfully, pushing the boundaries of conventional computing infrastructure. They require environment friendly techniques for distributing workloads throughout a number of GPU accelerated servers, and optimizing developer velocity in addition to efficiency.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/index.html\" target=\"_blank\" rel=\"noopener\">Ray<\/a> is an open supply framework that makes it easy to create, deploy, and optimize distributed Python jobs. At its core, Ray affords a unified programming mannequin that enables builders to seamlessly scale their functions from a single machine to a distributed cluster. It gives a set of high-level APIs for duties, actors, and information that summary away the complexities of distributed computing, enabling builders to give attention to the core logic of their functions. Ray promotes the identical coding patterns for each a easy machine studying (ML) experiment and a scalable, resilient manufacturing software. Ray\u2019s key options embody environment friendly activity scheduling, fault tolerance, and computerized useful resource administration, making it a robust software for constructing a variety of distributed functions, from ML fashions to real-time information processing pipelines. With its rising ecosystem of libraries and instruments, Ray has turn out to be a well-liked selection for organizations wanting to make use of the ability of distributed computing to sort out advanced and data-intensive issues.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/sagemaker-ai\/hyperpod\/\" target=\"_blank\" rel=\"noopener\">Amazon SageMaker HyperPod<\/a> is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the flexibleness to create and use your individual software program stack, but in addition gives optimum efficiency via identical backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of Ray gives a robust framework to scale up your generative AI workloads.<\/p>\n<p>On this submit, we display the steps concerned in working Ray jobs on SageMaker HyperPod.<\/p>\n<h2>Overview of Ray<\/h2>\n<p>This part gives a high-level overview of the Ray instruments and frameworks for AI\/ML workloads. We primarily give attention to ML coaching use instances.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/ray\" target=\"_blank\" rel=\"noopener\">Ray<\/a> is an open-source distributed computing framework designed to run extremely scalable and parallel Python functions. Ray manages, executes, and optimizes compute wants throughout AI workloads. It unifies infrastructure via a single, versatile framework\u2014enabling AI workloads from information processing, to mannequin coaching, to mannequin serving and past.<\/p>\n<p>For distributed jobs, Ray gives intuitive instruments for parallelizing and scaling ML workflows. It permits builders to give attention to their coaching logic with out the complexities of useful resource allocation, activity scheduling, and inter-node communication.<\/p>\n<p>At a excessive degree, Ray is made up of three layers:<\/p>\n<ul>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/walkthrough.html\" target=\"_blank\" rel=\"noopener\">Ray Core<\/a>: The inspiration of Ray, offering primitives for parallel and distributed computing<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-overview\/getting-started.html\" target=\"_blank\" rel=\"noopener\">Ray AI libraries<\/a>:\n<ul>\n<li><strong>Ray Practice<\/strong> \u2013 A library that simplifies distributed coaching by providing built-in help for fashionable ML frameworks like PyTorch, TensorFlow, and Hugging Face<\/li>\n<li><strong>Ray Tune<\/strong> \u2013 A library for scalable hyperparameter tuning<\/li>\n<li><strong>Ray Serve<\/strong> \u2013 A library for distributed mannequin deployment and serving<\/li>\n<\/ul>\n<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/key-concepts.html#id3\" target=\"_blank\" rel=\"noopener\">Ray clusters<\/a>: A distributed computing platform the place employee nodes run person code as Ray duties and actors, usually within the cloud<\/li>\n<\/ul>\n<p>On this submit, we dive deep into working Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/key-concepts.html#cluster-head-node\" target=\"_blank\" rel=\"noopener\">head node<\/a> and a lot of linked <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/key-concepts.html#cluster-worker-nodes\" target=\"_blank\" rel=\"noopener\">employee nodes<\/a>. The top node orchestrates activity scheduling, useful resource allocation, and communication between nodes. The ray employee nodes execute the distributed workloads utilizing Ray duties and actors, equivalent to mannequin coaching or information preprocessing.<\/p>\n<p>Ray clusters and Kubernetes clusters pair properly collectively. By working a Ray cluster on Kubernetes utilizing the KubeRay operator, each Ray customers and Kubernetes directors profit from the graceful path from improvement to manufacturing. For this use case, we use a SageMaker HyperPod cluster orchestrated via <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/eks\/\" target=\"_blank\" rel=\"noopener\">Amazon Elastic Kubernetes Service<\/a> (Amazon EKS).<\/p>\n<p>The <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/kuberay\" target=\"_blank\" rel=\"noopener\">KubeRay operator<\/a> lets you run a Ray cluster on a Kubernetes cluster. KubeRay creates the next {custom} useful resource definitions (CRDs):<\/p>\n<ul>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/kubernetes\/getting-started\/rayjob-quick-start.html\" target=\"_blank\" rel=\"noopener\">RayCluster<\/a> \u2013 The first useful resource for managing Ray situations on Kubernetes. The nodes in a Ray cluster manifest as pods within the Kubernetes cluster.<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/kubernetes\/user-guides\/config.html\" target=\"_blank\" rel=\"noopener\">RayJob<\/a> \u2013 A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting duties or batches of duties to be executed by the Ray cluster. A RayJob additionally manages the lifecycle of the Ray cluster, making it ephemeral by robotically spinning up the cluster when the job is submitted and shutting it down when the job is full.<\/li>\n<li><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/kubernetes\/getting-started\/rayservice-quick-start.html\" target=\"_blank\" rel=\"noopener\">RayService<\/a> \u2013 A Ray cluster and a Serve software that runs on prime of it right into a single Kubernetes manifest. It permits for the deployment of Ray functions that have to be uncovered for exterior communication, sometimes via a service endpoint.<\/li>\n<\/ul>\n<p>For the rest of this submit, we don\u2019t give attention to RayJob or RayService; we give attention to making a persistent Ray cluster to run distributed ML coaching jobs.<\/p>\n<p>When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we&#8217;ll dive deeper into later on this submit. This mixture gives an answer for dealing with dynamic workloads, sustaining excessive availability, and offering seamless restoration from node failures, which is essential for long-running jobs.<\/p>\n<h2>Overview of SageMaker HyperPod<\/h2>\n<p>On this part, we introduce SageMaker HyperPod and its built-in resiliency options to offer infrastructure stability.<\/p>\n<p>Generative AI workloads equivalent to coaching, inference, and fine-tuning contain constructing, sustaining, and optimizing giant clusters of hundreds of GPU accelerated situations. For distributed coaching, the purpose is to effectively parallelize workloads throughout these situations so as to maximize cluster utilization and reduce time to coach. For giant-scale inference, it\u2019s vital to attenuate latency, maximize throughput, and seamlessly scale throughout these situations for the perfect person expertise. SageMaker HyperPod is a purpose-built infrastructure to deal with these wants. It removes the undifferentiated heavy lifting concerned in constructing, sustaining, and optimizing a big GPU accelerated cluster. It additionally gives flexibility to totally customise your coaching or inference setting and compose your individual software program stack. You should utilize both Slurm or Amazon EKS for orchestration with SageMaker HyperPod.<\/p>\n<p>Because of their large dimension and the necessity to prepare on giant quantities of knowledge, FMs are sometimes educated and deployed on giant compute clusters composed of hundreds of AI accelerators equivalent to GPUs and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/machine-learning\/trainium\/\" target=\"_blank\" rel=\"noopener\">AWS Trainium<\/a>. A single failure in one among these thousand accelerators can interrupt your complete coaching course of, requiring handbook intervention to determine, isolate, debug, restore, and get better the defective node within the cluster. This workflow can take a number of hours for every failure and because the scale of the cluster grows, it\u2019s widespread to see a failure each few days and even each few hours. SageMaker HyperPod gives resiliency towards infrastructure failures by making use of brokers that constantly run well being checks on cluster situations, repair the unhealthy situations, reload the final legitimate checkpoint, and resume the coaching\u2014with out person intervention. In consequence, you&#8217;ll be able to prepare your fashions as much as 40% sooner. You may also SSH into an occasion within the cluster for debugging and collect insights on hardware-level optimization throughout multi-node coaching. Orchestrators like Slurm or Amazon EKS facilitate environment friendly allocation and administration of sources, present optimum job scheduling, monitor useful resource utilization, and automate fault tolerance.<\/p>\n<h2>Resolution overview<\/h2>\n<p>This part gives an summary of the right way to run Ray jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the structure and the method of making a SageMaker HyperPod cluster, putting in the KubeRay operator, and deploying a Ray coaching job.<\/p>\n<p>Though this submit gives a step-by-step information to manually create the cluster, be happy to take a look at the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/bit.ly\/aws-do-ray\" target=\"_blank\" rel=\"noopener\">aws-do-ray<\/a> mission, which goals to simplify the deployment and scaling of distributed Python software utilizing Ray on Amazon EKS or SageMaker HyperPod. It makes use of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.docker.com\/get-docker\/\" target=\"_blank\" rel=\"noopener\">Docker<\/a> to containerize the instruments essential to deploy and handle Ray clusters, jobs, and companies. Along with the aws-do-ray mission, we\u2019d like to spotlight the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/catalog.workshops.aws\/sagemaker-hyperpod-eks\/en-US\/\" target=\"_blank\" rel=\"noopener\">Amazon SageMaker Hyperpod EKS workshop<\/a>, which affords an end-to-end expertise for working varied workloads on SageMaker Hyperpod clusters. There are a number of examples of coaching and inference workloads from the GitHub repository <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws-samples\/awsome-distributed-training\/tree\/main\/1.architectures\/7.sagemaker-hyperpod-eks\" target=\"_blank\" rel=\"noopener\">awsome-distributed-training<\/a>.<\/p>\n<p>As launched earlier on this submit, KubeRay simplifies the deployment and administration of Ray functions on Kubernetes. The next diagram illustrates the answer structure.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/colors.drawio.png\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-102587 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/colors.drawio.png\" alt=\"SMHP EKS Architecture\" width=\"1025\" height=\"887\"\/><\/a><\/p>\n<h2>Create a SageMaker HyperPod cluster<\/h2>\n<h3>Conditions<\/h3>\n<p>Earlier than deploying Ray on SageMaker HyperPod, you want a HyperPod cluster:<\/p>\n<p>In case you want to deploy HyperPod on an present EKS cluster, please observe the directions <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-hyperpod-eks-operate-cli-command-create-cluster.html\" target=\"_blank\" rel=\"noopener\">right here<\/a> which embody:<\/p>\n<ul>\n<li><strong>EKS cluster <\/strong>\u2013 You&#8217;ll be able to affiliate SageMaker HyperPod compute to an present EKS cluster that satisfies the set of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-hyperpod-eks-prerequisites.html\" target=\"_blank\" rel=\"noopener\">stipulations<\/a>. Alternatively and beneficial, you&#8217;ll be able to deploy a ready-made EKS cluster with a single <a rel=\"nofollow\" target=\"_blank\" href=\"http:\/\/aws.amazon.com\/cloudformation\" target=\"_blank\" rel=\"noopener\">AWS CloudFormation<\/a> <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws-samples\/awsome-distributed-training\/blob\/main\/1.architectures\/7.sagemaker-hyperpod-eks\/cfn-templates\/hyperpod-eks-full-stack.yaml\" target=\"_blank\" rel=\"noopener\">template<\/a>. Consult with the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws-samples\/awsome-distributed-training\/tree\/main\/1.architectures\/7.sagemaker-hyperpod-eks\" target=\"_blank\" rel=\"noopener\">GitHub repo<\/a> for directions on organising an EKS cluster.<\/li>\n<li><strong>Customized sources <\/strong>\u2013 Operating multi-node distributed coaching requires varied sources, equivalent to gadget plugins, Container Storage Interface (CSI) drivers, and coaching operators, to be pre-deployed on the EKS cluster. You additionally have to deploy further sources for the well being monitoring agent and deep well being test. <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws\/sagemaker-hyperpod-cli\/tree\/main\/helm_chart\" target=\"_blank\" rel=\"noopener\">HyperPodHelmCharts<\/a> simplify the method utilizing <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/helm.sh\/\" target=\"_blank\" rel=\"noopener\">Helm<\/a>, one among mostly used bundle mangers for Kubernetes. Consult with <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html\" target=\"_blank\" rel=\"noopener\">Set up packages on the Amazon EKS cluster utilizing Helm<\/a> for set up directions.<\/li>\n<\/ul>\n<p>The next present an instance workflow for making a HyperPod cluster on an present EKS Cluster after deploying stipulations. <strong>That is for reference solely and never required for the short deploy possibility<\/strong>.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cat &gt; cluster-config.json &lt;&lt; EOL\n{\n    \"ClusterName\": \"ml-cluster\",\n    \"Orchestrator\": {\n        \"Eks\": {\n            \"ClusterArn\": \"${EKS_CLUSTER_ARN}\"\n        }\n    },\n    \"InstanceGroups\": [\n        {\n            \"InstanceGroupName\": \"worker-group-1\",\n            \"InstanceType\": \"ml.p5.48xlarge\",\n            \"InstanceCount\": 4,\n            \"LifeCycleConfig\": {\n                \"SourceS3Uri\": \"s3:\/\/amzn-s3-demo-bucket\",\n                \"OnCreate\": \"on_create.sh\"\n            },\n            \"ExecutionRole\": \"${EXECUTION_ROLE}\",\n            \"ThreadsPerCore\": 1,\n            \"OnStartDeepHealthChecks\": [\n                \"InstanceStress\",\n                \"InstanceConnectivity\"\n            ]\n        },\n        {\n            \"InstanceGroupName\": \"head-group\",\n            \"InstanceType\": \"ml.m5.2xlarge\",\n            \"InstanceCount\": 1,\n            \"LifeCycleConfig\": {\n                \"SourceS3Uri\": \"s3:\/\/amzn-s3-demo-bucket\",\n                \"OnCreate\": \"on_create.sh\"\n            },\n            \"ExecutionRole\": \"${EXECUTION_ROLE}\",\n            \"ThreadsPerCore\": 1,\n        }\n    ],\n    \"VpcConfig\": {\n        \"SecurityGroupIds\": [\n            \"${SECURITY_GROUP_ID}\"\n        ],\n        \"Subnets\": [\n            \"${SUBNET_ID}\"\n        ]\n    },\n    \"NodeRecovery\": \"Automated\"\n}\nEOL<\/code><\/pre>\n<\/p><\/div>\n<p>The supplied configuration file accommodates two key highlights:<\/p>\n<ul>\n<li><strong>\u201cOnStartDeepHealthChecks\u201d: [\u201cInstanceStress\u201d, \u201cInstanceConnectivity\u201d]<\/strong> \u2013 Instructs SageMaker HyperPod to conduct a deep well being test at any time when new GPU or Trainium situations are added<\/li>\n<li><strong>\u201cNodeRecovery\u201d: \u201cAutomated\u201d<\/strong> \u2013 Permits SageMaker HyperPod automated node restoration<\/li>\n<\/ul>\n<p>You&#8217;ll be able to create a SageMaker HyperPod compute with the next <a rel=\"nofollow\" target=\"_blank\" href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener\">AWS Command Line Interface<\/a> (AWS CLI) command (AWS CLI model 2.17.47 or newer is required):<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws sagemaker create-cluster \n    --cli-input-json file:\/\/cluster-config.json\n{\n\"ClusterArn\": \"arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster\/wccy5z4n4m49\"\n}<\/code><\/pre>\n<\/p><\/div>\n<p>To confirm the cluster standing, you should use the next command:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws sagemaker list-clusters --output desk<\/code><\/pre>\n<\/p><\/div>\n<p>This command shows the cluster particulars, together with the cluster title, standing, and creation time:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-code\">------------------------------------------------------------------------------------------------------------------------------------------------------\n|                                                                    ListClusters                                                                    |\n+----------------------------------------------------------------------------------------------------------------------------------------------------+\n||                                                                 ClusterSummaries                                                                 ||\n|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|\n||                           ClusterArn                           |        ClusterName        | ClusterStatus  |           CreationTime             ||\n|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|\n||  arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster\/zsmyi57puczf |         ml-cluster        |   InService     |  2025-03-03T16:45:05.320000+00:00  ||\n|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|<\/code><\/pre>\n<\/p><\/div>\n<p>Alternatively, you&#8217;ll be able to confirm the cluster standing on the SageMaker console. After a quick interval, you&#8217;ll be able to observe that the standing for the nodes transitions to <code>Operating<\/code>.<\/p>\n<h2>Create an FSx for Lustre shared file system<\/h2>\n<p>For us to deploy the Ray cluster, we want the SageMaker HyperPod cluster to be up and working, and moreover we want a shared storage quantity (for instance, an <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/fsx\/lustre\/\" target=\"_blank\" rel=\"noopener\">Amazon FSx for Lustre<\/a> file system). It is a shared file system that the SageMaker HyperPod nodes can entry. This file system could be provisioned statically earlier than launching your SageMaker HyperPod cluster or dynamically afterwards.<\/p>\n<p>Specifying a shared storage location (equivalent to cloud storage or NFS) is non-compulsory for single-node clusters, however it&#8217;s required for multi-node clusters. Utilizing an area path will <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/user-guides\/persistent-storage.html#multinode-local-storage-warning\" target=\"_blank\" rel=\"noopener\">increase an error<\/a> throughout checkpointing for multi-node clusters.<\/p>\n<p>The <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/kubernetes-sigs\/aws-fsx-csi-driver\" target=\"_blank\" rel=\"noopener\">Amazon FSx for Lustre CSI driver<\/a> makes use of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/eks\/latest\/userguide\/iam-roles-for-service-accounts.html\" target=\"_blank\" rel=\"noopener\">IAM roles for service accounts (IRSA)<\/a> to authenticate AWS API calls. To make use of IRSA, an <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/eks\/latest\/userguide\/enable-iam-roles-for-service-accounts.html\" target=\"_blank\" rel=\"noopener\">IAM OpenID Join (OIDC) supplier<\/a> must be related to the OIDC issuer URL that comes provisioned your EKS cluster.<\/p>\n<p>Create an IAM OIDC identification supplier on your cluster with the next command:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve<\/code><\/pre>\n<\/p><\/div>\n<p>Deploy the FSx for Lustre CSI driver:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">helm repo add aws-fsx-csi-driver https:\/\/kubernetes-sigs.github.io\/aws-fsx-csi-driver\nhelm repo replace\nhelm improve --install aws-fsx-csi-driver aws-fsx-csi-driver\/aws-fsx-csi-driver\n  --namespace kube-system <\/code><\/pre>\n<\/p><\/div>\n<p>This Helm chart features a service account named <code>fsx-csi-controller-sa<\/code> that will get deployed within the <code>kube-system<\/code> namespace.<\/p>\n<p>Use the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/eksctl.io\/usage\/iamserviceaccounts\/\" target=\"_blank\" rel=\"noopener\">eksctl CLI <\/a>to create an <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/iam\/\" target=\"_blank\" rel=\"noopener\">AWS Id and Entry Administration<\/a> (IAM) position sure to the service account utilized by the motive force, attaching the <code>AmazonFSxFullAccess<\/code> AWS managed coverage:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">eksctl create iamserviceaccount \n  --name fsx-csi-controller-sa \n  --override-existing-serviceaccounts \n  --namespace kube-system \n  --cluster $EKS_CLUSTER_NAME \n  --attach-policy-arn arn:aws:iam::aws:coverage\/AmazonFSxFullAccess \n  --approve \n  --role-name AmazonEKSFSxLustreCSIDriverFullAccess \n  --region $AWS_REGION<\/code><\/pre>\n<\/p><\/div>\n<p>The <code>--override-existing-serviceaccounts<\/code> flag lets eksctl know that the <code>fsx-csi-controller-sa<\/code> service account already exists on the EKS cluster, so it skips creating a brand new one and updates the metadata of the present service account as a substitute.<\/p>\n<p>Annotate the motive force\u2019s service account with the Amazon Useful resource Identify (ARN) of the <code>AmazonEKSFSxLustreCSIDriverFullAccess<\/code> IAM position that was created:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Position.Arn' --output textual content)\n\nkubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa \n  eks.amazonaws.com\/role-arn=${SA_ROLE_ARN} --overwrite=true<\/code><\/pre>\n<\/p><\/div>\n<p>This annotation lets the motive force know what IAM position it ought to use to work together with the FSx for Lustre service in your behalf.<\/p>\n<p>Confirm that the service account has been correctly annotated:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml<\/code><\/pre>\n<\/p><\/div>\n<p>Restart the <code>fsx-csi-controller<\/code> deployment for the modifications to take impact:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">kubectl rollout restart deployment fsx-csi-controller -n kube-system<\/code><\/pre>\n<\/p><\/div>\n<p>The <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/kubernetes-sigs\/aws-fsx-csi-driver\" target=\"_blank\" rel=\"noopener\">FSx for Lustre CSI driver<\/a> presents you with two choices for provisioning a file system:<\/p>\n<ul>\n<li><strong>Dynamic provisioning<\/strong> \u2013 This feature makes use of Persistent Quantity Claims (PVCs) in Kubernetes. You outline a PVC with desired storage specs. The CSI driver robotically provisions the FSx for Lustre file system for you based mostly on the PVC request. This enables for easy scaling and eliminates the necessity to manually create file techniques.<\/li>\n<li><strong>Static provisioning<\/strong> \u2013 On this methodology, you manually create the FSx for Lustre file system earlier than utilizing the CSI driver. You&#8217;ll need to configure particulars like subnet ID and safety teams for the file system. Then, you should use the motive force to mount this pre-created file system inside your container as a quantity.<\/li>\n<\/ul>\n<p>For this instance, we use dynamic provisioning. Begin by making a storage class that makes use of the <code>fsx.csi.aws.com<\/code> provisioner:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cat &lt;<eof> storageclass.yaml\nsort: StorageClass\napiVersion: storage.k8s.io\/v1\nmetadata:\n  title: fsx-sc\nprovisioner: fsx.csi.aws.com\nparameters:\n  subnetId: ${SUBNET_ID}\n  securityGroupIds: ${SECURITYGROUP_ID}\n  deploymentType: PERSISTENT_2\n  automaticBackupRetentionDays: \"0\"\n  copyTagsToBackups: \"true\"\n  perUnitStorageThroughput: \"250\"\n  dataCompressionType: \"LZ4\"\n  fileSystemTypeVersion: \"2.12\"\nmountOptions:\n  - flock\nEOF\n\nkubectl apply -f storageclass.yaml<\/eof><\/code><\/pre>\n<\/p><\/div>\n<ul>\n<li><code>SUBNET_ID<\/code>: The subnet ID that the FSx for Lustre filesystem. Ought to be the identical personal subnet that was used for HyperPod creation.<\/li>\n<li><code>SECURITYGROUP_ID<\/code>: The safety group IDs that will likely be hooked up to the file system. Ought to be the identical Safety Group ID that&#8217;s utilized in HyperPod and EKS.<\/li>\n<\/ul>\n<p>Subsequent, create a PVC that makes use of the <code>fsx-claim<\/code> storage declare:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cat &lt;<eof> pvc.yaml\napiVersion: v1\nsort: PersistentVolumeClaim\nmetadata:\n  title: fsx-claim\nspec:\n  accessModes:\n  - ReadWriteMany\n  storageClassName: fsx-sc\n  sources:\n    requests:\n      storage: 1200Gi\nEOF\n\nkubectl apply -f pvc.yaml<\/eof><\/code><\/pre>\n<\/p><\/div>\n<p>This PVC will begin the dynamic provisioning of an FSx for Lustre file system based mostly on the specs supplied within the storage class.<\/p>\n<h2>Create the Ray cluster<\/h2>\n<p>Now that we&#8217;ve got each the SageMaker HyperPod cluster and the FSx for Lustre file system created, we will arrange the Ray cluster:<\/p>\n<ol>\n<li>Arrange dependencies. We are going to create a brand new namespace in our Kubernetes cluster and set up the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/kuberay\" target=\"_blank\" rel=\"noopener\">KubeRay<\/a> operator utilizing a Helm chart.<\/li>\n<\/ol>\n<p>We advocate utilizing KubeRay operator model 1.2.0 or greater, which helps computerized Ray Pod eviction and alternative in case of failures (for instance, {hardware} points on EKS or SageMaker HyperPod nodes).<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># Create KubeRay namespace\nkubectl create namespace kuberay\n# Deploy the KubeRay operator with the Helm chart repository\nhelm repo add kuberay https:\/\/ray-project.github.io\/kuberay-helm\/\nhelm repo replace\n#Set up each CRDs and Kuberay operator v1.2.0\nhelm set up kuberay-operator kuberay\/kuberay-operator --version 1.2.0 --namespace kuberay\n# Kuberay operator pod will likely be deployed onto head pod\nkubectl get pods --namespace kuberay<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"2\">\n<li>Create a Ray Container Picture for the Ray Cluster manifest. With the current <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/ray\/issues\/46378\" target=\"_blank\" rel=\"noopener\">deprecation<\/a> of the `<code>rayproject\/ray-ml<\/code>` pictures ranging from Ray model 2.31.0, it\u2019s essential to create a {custom} container picture for our Ray cluster. Subsequently, we&#8217;ll construct on prime of the `<code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/hub.docker.com\/layers\/rayproject\/ray\/2.42.1-py310-gpu\/images\/sha256-1c67c9417c2cd1b8c2347827c02a358b563d30fb4e83a3d7d328bd356adcec29\" target=\"_blank\" rel=\"noopener\">rayproject\/ray:2.42.1-py310-gpu<\/a><\/code>` picture, which has all obligatory Ray dependencies, and embody our coaching dependencies to construct our personal {custom} picture. Please be happy to switch this Dockerfile as you would like.<\/li>\n<\/ol>\n<p>First, create a <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.docker.com\/build\/concepts\/dockerfile\/\" target=\"_blank\" rel=\"noopener\">Dockerfile<\/a> that builds upon the bottom Ray GPU picture and contains solely the required dependencies:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cat &lt;<eof> Dockerfile\n \nFROM rayproject\/ray:2.42.1-py310-gpu\n# Set up Python dependencies for PyTorch, Ray, Hugging Face, and extra\nRUN pip set up --no-cache-dir \n    torch torchvision torchaudio \n    numpy \n    pytorch-lightning \n    transformers datasets consider tqdm click on \n    ray[train] ray[air] \n    ray[train-torch] ray[train-lightning] \n    torchdata \n    torchmetrics \n    torch_optimizer \n    speed up \n    scikit-learn \n    Pillow==9.5.0 \n    protobuf==3.20.3\n \nRUN pip set up --upgrade datasets transformers\n \n# Set the person\nUSER ray\nWORKDIR \/house\/ray\n \n# Confirm ray set up\nRUN which ray &amp;&amp; \n    ray \u2013-version\n  \n# Default command\nCMD [ \"\/bin\/bash\" ]\n \nEOF\n<\/eof><\/code><\/pre>\n<\/p><\/div>\n<p>Then, construct and push the picture to your container registry (<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/ecr\/\" target=\"_blank\" rel=\"noopener\">Amazon ECR<\/a>) utilizing the supplied script:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">export AWS_REGION=$(aws configure get area)\nexport ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)\nexport REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com\/\n \necho \"This course of might take 10-Quarter-hour to finish...\"\n \necho \"Constructing picture...\"\n \ndocker construct --platform linux\/amd64 -t ${REGISTRY}aws-ray-custom:newest .\n \n# Create registry if wanted\nREGISTRY_COUNT=$(aws ecr describe-repositories | grep \"aws-ray-custom\" | wc -l)\nif [ \"$REGISTRY_COUNT\" == \"0\" ]; then\n    aws ecr create-repository --repository-name aws-ray-custom\nfi\n \n# Login to registry\necho \"Logging in to $REGISTRY ...\"\naws ecr get-login-password --region $AWS_REGION| docker login --username AWS --password-stdin $REGISTRY\n \necho \"Pushing picture to $REGISTRY ...\"\n \n# Push picture to registry\ndocker picture push ${REGISTRY}aws-ray-custom:newest \n<\/code><\/pre>\n<\/p><\/div>\n<p>Now, our Ray container picture is in Amazon ECR with all obligatory Ray dependencies, in addition to code library dependencies.<\/p>\n<ol start=\"3\">\n<li>Create a Ray cluster manifest. We use a Ray cluster to host our coaching jobs. The Ray cluster is the first useful resource for managing Ray situations on Kubernetes. It represents a cluster of Ray nodes, together with a head node and a number of employee nodes. The Ray cluster CRD determines how the Ray nodes are arrange, how they impart, and the way sources are allotted amongst them. The nodes in a Ray cluster manifest as pods within the EKS or SageMaker HyperPod cluster.<\/li>\n<\/ol>\n<p>Notice that there are two distinct sections within the cluster manifest. Whereas the `<code>headGroupSpec<\/code>` defines the top node of the Ray Cluster, the `<code>workerGroupSpecs<\/code>` outline the employee nodes of the Ray Cluster. Whereas a job might technically run on the Head node as properly, it&#8217;s common to separate the top node from the precise employee nodes the place jobs are executed. Subsequently, the occasion for the top node can sometimes be a smaller occasion (i.e. we selected a m5.2xlarge). For the reason that head node additionally manages cluster-level metadata, it may be helpful to have it run on a non-GPU node to attenuate the danger of node failure (as GPU generally is a potential supply of node failure).<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">cat &lt;&lt;'EOF' &gt; raycluster.yaml\napiVersion: ray.io\/v1alpha1\nsort: RayCluster\nmetadata:\n  title: rayml\n  labels:\n    controller-tools.k8s.io: \"1.0\"\nspec:\n  # Ray head pod template\n  headGroupSpec:\n    # The `rayStartParams` are used to configure the `ray begin` command.\n    # See https:\/\/github.com\/ray-project\/kuberay\/blob\/grasp\/docs\/steerage\/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.\n    # See https:\/\/docs.ray.io\/en\/newest\/cluster\/cli.html#ray-start for all accessible choices in `rayStartParams`.\n    rayStartParams:\n      dashboard-host: '0.0.0.0'\n    #pod template\n    template:\n      spec:\n        #        nodeSelector:  \n        #node.kubernetes.io\/instance-type: \"ml.m5.2xlarge\"\n        securityContext:\n          runAsUser: 0\n          runAsGroup: 0\n          fsGroup: 0\n        containers:\n        - title: ray-head\n          picture: ${REGISTRY}aws-ray-custom:newest     ## IMAGE: Right here you might select which picture your head pod will run\n          env:                                ## ENV: Right here is the place you'll be able to ship stuff to the top pod\n            - title: RAY_GRAFANA_IFRAME_HOST   ## PROMETHEUS AND GRAFANA\n              worth: http:\/\/localhost:3000\n            - title: RAY_GRAFANA_HOST\n              worth: http:\/\/prometheus-grafana.prometheus-system.svc:80\n            - title: RAY_PROMETHEUS_HOST\n              worth: http:\/\/prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090\n          lifecycle:\n            preStop:\n              exec:\n                command: [\"\/bin\/sh\",\"-c\",\"ray stop\"]\n          sources:\n            limits:                                    ## LIMITS: Set useful resource limits on your head pod\n              cpu: 1\n              reminiscence: 8Gi\n            requests:                                    ## REQUESTS: Set useful resource requests on your head pod\n              cpu: 1\n              reminiscence: 8Gi\n          ports:\n          - containerPort: 6379\n            title: gcs-server\n          - containerPort: 8265 # Ray dashboard\n            title: dashboard\n          - containerPort: 10001\n            title: consumer\n          - containerPort: 8000\n            title: serve\n          volumeMounts:                                    ## VOLUMEMOUNTS\n          - title: fsx-storage\n            mountPath: \/fsx\n          - title: ray-logs\n            mountPath: \/tmp\/ray\n        volumes:\n          - title: ray-logs\n            emptyDir: {}\n          - title: fsx-storage\n            persistentVolumeClaim:\n              claimName: fsx-claim\n  workerGroupSpecs:\n  # the pod replicas on this group typed employee\n  - replicas: 4                                    ## REPLICAS: What number of employee pods you need \n    minReplicas: 1\n    maxReplicas: 10\n    # logical group title, for this known as small-group, additionally could be useful\n    groupName: gpu-group\n    rayStartParams:\n      num-gpus: \"8\"\n    #pod template\n    template:\n      spec:\n        #nodeSelector:\n        # node.kubernetes.io\/instance-type: \"ml.p5.48xlarge\"\n        securityContext:\n          runAsUser: 0\n          runAsGroup: 0\n          fsGroup: 0\n        containers:\n        - title: ray-worker\n          picture: ${REGISTRY}aws-ray-custom:newest             ## IMAGE: Right here you might select which picture your head node will run\n          env:\n          lifecycle:\n            preStop:\n              exec:\n                command: [\"\/bin\/sh\",\"-c\",\"ray stop\"]\n          sources:\n            limits:                                    ## LIMITS: Set useful resource limits on your employee pods\n              nvidia.com\/gpu: 8\n              #vpc.amazonaws.com\/efa: 32  \n            requests:                                    ## REQUESTS: Set useful resource requests on your employee pods\n              nvidia.com\/gpu: 8\n              #vpc.amazonaws.com\/efa: 32\n          volumeMounts:                                    ## VOLUMEMOUNTS\n          - title: ray-logs\n            mountPath: \/tmp\/ray\n          - title: fsx-storage\n            mountPath: \/fsx\n        volumes:\n        - title: fsx-storage\n          persistentVolumeClaim:\n            claimName: fsx-claim\n        - title: ray-logs\n          emptyDir: {}\nEOF<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"4\">\n<li>Deploy the Ray cluster:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">envsubst &lt; raycluster.yaml | kubectl apply -f -<\/code><\/pre>\n<\/p><\/div>\n<ol start=\"5\">\n<li>Optionally, expose the Ray dashboard utilizing port forwarding:<\/li>\n<\/ol>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># Will get title of kubectl service that runs the top pod\nexport SERVICEHEAD=$(kubectl get service | grep head-svc | awk '{print $1}' | head -n 1)\n# Port forwards the dashboard from the top pod service\nkubectl port-forward --address 0.0.0.0 service\/${SERVICEHEAD} 8265:8265 &gt; \/dev\/null 2&gt;&amp;1 &amp;<\/code><\/pre>\n<\/p><\/div>\n<p>Now, you&#8217;ll be able to go to <code>http:\/\/localhost:8265\/<\/code> to go to the Ray Dashboard.<\/p>\n<ol start=\"6\">\n<li>To launch a coaching job, there are a number of choices:\n<ol type=\"a\">\n<li>Use the Ray jobs submission SDK, the place you&#8217;ll be able to submit jobs to the Ray cluster via the Ray dashboard port (8265 by default) the place Ray listens for job requests. To be taught extra, see <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/running-applications\/job-submission\/quickstart.html#jobs-quickstart\" target=\"_blank\" rel=\"noopener\">Quickstart utilizing the Ray Jobs CLI<\/a>.<\/li>\n<li>Execute a Ray job within the head pod the place you exec immediately into the top pod after which submit your job. To be taught extra, see <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/cluster\/kubernetes\/getting-started\/raycluster-quick-start.html\" target=\"_blank\" rel=\"noopener\">RayCluster Quickstart<\/a>.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>For this instance, we use the primary methodology and submit the job via the SDK. Subsequently, we merely run from an area setting the place the coaching code is offered in <code>--working-dir<\/code>. Relative to this path, we specify the principle coaching Python script situated at <code>--train.py<\/code><br \/>Throughout the <code>working-dir<\/code> folder, we will additionally embody further scripts we&#8217;d have to run the coaching.<\/p>\n<p>The <code>fsdp-ray.py<\/code> instance is situated in <code>aws-do-ray\/Container-Root\/ray\/raycluster\/jobs\/fsdp-ray\/fsdp-ray.py<\/code> within the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws-samples\/aws-do-ray\/tree\/main\" target=\"_blank\" rel=\"noopener\">aws-do-ray GitHub repo<\/a>.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\"># Inside jobs\/ folder\nray job submit --address http:\/\/localhost:8265 --working-dir \"fsdp-ray\" -- python3 fsdp-ray.py<\/code><\/pre>\n<\/p><\/div>\n<p>For our Python coaching script to run, we want to verify our coaching scripts are appropriately arrange to make use of Ray. This contains the next steps:<\/p>\n<ul>\n<li>Configure a mannequin to run distributed and on the right CPU\/GPU gadget<\/li>\n<li>Configure an information loader to shard information throughout the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/releases-2.8.0\/train\/overview.html#train-overview-worker\" target=\"_blank\" rel=\"noopener\">staff<\/a> and place information on the right CPU or GPU gadget<\/li>\n<li>Configure a <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/releases-2.8.0\/train\/overview.html#train-overview-training-function\" target=\"_blank\" rel=\"noopener\">coaching perform<\/a> to report metrics and save checkpoints<\/li>\n<li>Configure <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/releases-2.8.0\/train\/overview.html#train-overview-scaling-config\" target=\"_blank\" rel=\"noopener\">scaling<\/a> and CPU or GPU useful resource necessities for a coaching job<\/li>\n<li>Launch a distributed coaching job with a <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/releases-2.8.0\/train\/api\/doc\/ray.train.torch.TorchTrainer.html#ray.train.torch.TorchTrainer\" target=\"_blank\" rel=\"noopener\">TorchTrainer<\/a><\/code> class<\/li>\n<\/ul>\n<p>For additional particulars on the right way to regulate your present coaching script to get essentially the most out of Ray, consult with the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/train.html\" target=\"_blank\" rel=\"noopener\">Ray documentation<\/a>.<\/p>\n<p>The next diagram illustrates the entire structure you&#8217;ve constructed after finishing these steps.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/ray-hyperpod-arch-1.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-102588 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/ray-hyperpod-arch-1-1024x537.png\" alt=\"Ray on Hyperpod EKS Architecture\" width=\"1024\" height=\"537\"\/><\/a><\/p>\n<h2>Implement coaching job resiliency with the job auto resume performance<\/h2>\n<p>Ray is designed with sturdy fault tolerance mechanisms to offer resilience in distributed techniques the place failures are inevitable. These failures usually fall into two classes: application-level failures, which stem from bugs in person code or exterior system points, and system-level failures, brought on by node crashes, community disruptions, or inside bugs in Ray. To deal with these challenges, Ray gives instruments and techniques that allow functions to detect, get better, and adapt seamlessly, offering reliability and efficiency in distributed environments. On this part, we take a look at two of the commonest kinds of failures, and the right way to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Practice employee failures and Ray employee node failures.<\/p>\n<ul>\n<li><strong>Ray Practice employee<\/strong> \u2013 It is a employee course of particularly used for coaching duties inside Ray Practice, Ray\u2019s distributed coaching library. These staff deal with particular person duties or shards of a distributed coaching job. Every employee is chargeable for processing a portion of the info, coaching a subset of the mannequin, or performing computation throughout distributed coaching. They&#8217;re coordinated by the Ray Practice orchestration logic to collectively prepare a mannequin.<\/li>\n<li><strong>Ray employee node<\/strong> \u2013 On the Ray degree, this can be a Ray node in a Ray cluster. It\u2019s a part of the Ray cluster infrastructure and is chargeable for working duties, actors, and different processes as orchestrated by the Ray head node. Every employee node can host a number of Ray processes that execute duties or handle distributed objects. On the Kubernetes degree, a Ray employee node is a Kubernetes pod that&#8217;s managed by a KubeRay operator. For this submit, we will likely be speaking concerning the Ray employee nodes on the Kubernetes degree, so we&#8217;ll consult with them as <em>pods<\/em>.<\/li>\n<\/ul>\n<p>On the time of writing, there are not any official updates concerning head pod fault tolerance and auto resume capabilities. Although head pod failures are uncommon, within the unlikely occasion of such a failure, you will want to manually restart your coaching job. Nonetheless, you&#8217;ll be able to nonetheless resume progress from the final saved checkpoint. To reduce the danger of hardware-related head pod failures, it\u2019s suggested to position the top pod on a devoted, CPU-only SageMaker HyperPod node, as a result of GPU failures are a standard coaching job failure level.<\/p>\n<h3>Ray Practice employee failures<\/h3>\n<p>Ray Practice is designed with fault tolerance to deal with employee failures, equivalent to <code>RayActorErrors<\/code>. When a failure happens, the affected staff are stopped, and new ones are robotically began to take care of operations. Nonetheless, for coaching progress to proceed seamlessly after a failure, saving and loading checkpoints is important. With out correct checkpointing, the coaching script will restart, however all progress will likely be misplaced. Checkpointing is due to this fact a crucial part of Ray Practice\u2019s fault tolerance mechanism and must be applied in your code.<\/p>\n<h4>Automated restoration<\/h4>\n<p>When a failure is detected, Ray shuts down failed staff and provisions new ones. Though this occurs, we will inform the coaching perform to at all times hold retrying till coaching can proceed. Every occasion of restoration from a employee failure is taken into account a retry. We will set the variety of retries via the <code>max_failures<\/code> attribute of the <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.FailureConfig.html\" target=\"_blank\" rel=\"noopener\">FailureConfig<\/a><\/code>, which is ready within the <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.RunConfig.html\" target=\"_blank\" rel=\"noopener\">RunConfig<\/a><\/code> handed to the <code>Coach<\/code> (for instance, <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.torch.TorchTrainer.html\" target=\"_blank\" rel=\"noopener\">TorchTrainer<\/a><\/code>). See the next code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from ray.prepare import RunConfig, FailureConfig\n# Tries to get better a run as much as this many occasions.\nrun_config = RunConfig(failure_config=FailureConfig(max_failures=2))\n# No restrict on the variety of retries.\nrun_config = RunConfig(failure_config=FailureConfig(max_failures=-1))<\/code><\/pre>\n<\/p><\/div>\n<p>For extra info, see <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/user-guides\/fault-tolerance.html\" target=\"_blank\" rel=\"noopener\">Dealing with Failures and Node Preemption<\/a>.<\/p>\n<h4>Checkpoints<\/h4>\n<p>A <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.Checkpoint.html#ray.train.Checkpoint\" target=\"_blank\" rel=\"noopener\">checkpoint<\/a> in Ray Practice is a light-weight interface representing a listing saved both domestically or remotely. For instance, a cloud-based checkpoint would possibly level to <code>s3:\/\/my-bucket\/checkpoint-dir<\/code>, and an area checkpoint would possibly level to <code>\/tmp\/checkpoint-dir<\/code>. To be taught extra, see <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/user-guides\/checkpoints.html#train-dl-saving-checkpoints\" target=\"_blank\" rel=\"noopener\">Saving checkpoints throughout coaching<\/a>.<\/p>\n<p>To avoid wasting a checkpoint within the coaching loop, you first want to jot down your checkpoint to an area listing, which could be non permanent. When saving, you should use checkpoint utilities from different frameworks like <code>torch.save<\/code>, <code>pl.Coach.save_checkpoint<\/code>, <code>accelerator.save_model<\/code>, <code>save_pretrained<\/code>, <code>tf.keras.Mannequin.save<\/code>, and extra. You then create a checkpoint from the listing utilizing <code>Checkpoint.from_directory<\/code>. Lastly, report the checkpoint to Ray Practice utilizing <code>ray.prepare.report(metrics, checkpoint=...)<\/code>. The metrics reported alongside the checkpoint are used to maintain observe of the best-performing checkpoints. Reporting will add the checkpoint to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/user-guides\/persistent-storage.html#persistent-storage-guide\" target=\"_blank\" rel=\"noopener\">persistent storage<\/a>.<\/p>\n<p>In case you save checkpoints with <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.report.html#ray.train.report\" target=\"_blank\" rel=\"noopener\">ray.prepare.report(..., checkpoint=...)<\/a><\/code> and run on a multi-node cluster, Ray Practice will increase an error if NFS or cloud storage isn&#8217;t arrange. It&#8217;s because Ray Practice expects all staff to have the ability to write the checkpoint to the identical persistent storage location.<\/p>\n<p>Lastly, clear up the native non permanent listing to unlock disk house (for instance, by exiting the <code>tempfile.TemporaryDirectory<\/code> context). We will save a checkpoint each epoch or each few iterations.<\/p>\n<p>The next diagram illustrates this setup.<\/p>\n<p><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/checkpoint-arch-1.png\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone wp-image-102589 size-large\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/checkpoint-arch-1-1024x207.png\" alt=\"Ray Checkpointing Architecture\" width=\"1024\" height=\"207\"\/><\/a><\/p>\n<p>The next code is an instance of saving checkpoints utilizing native PyTorch:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">import os\nimport tempfile\n\nimport numpy as np\nimport torch\nimport torch.nn as nn\nfrom torch.optim import Adam\n\nimport ray.prepare.torch\nfrom ray import prepare\nfrom ray.prepare import Checkpoint, ScalingConfig\nfrom ray.prepare.torch import TorchTrainer\n\n\ndef train_func(config):\n    n = 100\n    # create a toy dataset\n    # information   : X - dim = (n, 4)\n    # goal : Y - dim = (n, 1)\n    X = torch.Tensor(np.random.regular(0, 1, dimension=(n, 4)))\n    Y = torch.Tensor(np.random.uniform(0, 1, dimension=(n, 1)))\n    # toy neural community : 1-layer\n    # Wrap the mannequin in DDP\n    mannequin = ray.prepare.torch.prepare_model(nn.Linear(4, 1))\n    criterion = nn.MSELoss()\n\n    optimizer = Adam(mannequin.parameters(), lr=3e-4)\n    for epoch in vary(config[\"num_epochs\"]):\n        y = mannequin.ahead(X)\n        loss = criterion(y, Y)\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n\n        metrics = {\"loss\": loss.merchandise()}\n\n        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:\n            checkpoint = None\n\n            should_checkpoint = epoch % config.get(\"checkpoint_freq\", 1) == 0\n            # In normal DDP coaching, the place the mannequin is identical throughout all ranks,\n            # solely the worldwide rank 0 employee wants to save lots of and report the checkpoint\n            if prepare.get_context().get_world_rank() == 0 and should_checkpoint:\n                torch.save(\n                    mannequin.module.state_dict(),  # NOTE: Unwrap the mannequin.\n                    os.path.be part of(temp_checkpoint_dir, \"mannequin.pt\"),\n                )\n                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)\n\n            prepare.report(metrics, checkpoint=checkpoint)\n\n\ncoach = TorchTrainer(\n    train_func,\n    train_loop_config={\"num_epochs\": 5},\n    scaling_config=ScalingConfig(num_workers=2),\n)\noutcome = coach.match()\n<\/code><\/pre>\n<\/p><\/div>\n<p>Ray Practice additionally comes with <code><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/api\/doc\/ray.train.CheckpointConfig.html#ray.train.CheckpointConfig\" target=\"_blank\" rel=\"noopener\">CheckpointConfig<\/a><\/code>, a method to configure checkpointing choices:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from ray.prepare import RunConfig, CheckpointConfig\n# Instance 1: Solely hold the two *most up-to-date* checkpoints and delete the others.\nrun_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))\n# Instance 2: Solely hold the two *finest* checkpoints and delete the others.\nrun_config = RunConfig(\n    checkpoint_config=CheckpointConfig(\n        num_to_keep=2,\n        # *Greatest* checkpoints are decided by these params:\n        checkpoint_score_attribute=\"mean_accuracy\",\n        checkpoint_score_order=\"max\",\n    ),\n    # This may retailer checkpoints on S3.\n    storage_path=\"s3:\/\/remote-bucket\/location\",\n)<\/code><\/pre>\n<\/p><\/div>\n<p>To <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/train\/user-guides\/fault-tolerance.html#train-fault-tolerance\" target=\"_blank\" rel=\"noopener\">restore coaching state from a checkpoint<\/a> in case your coaching job had been to fail and retry, it is best to modify your coaching loop to auto resume after which restore a Ray Practice job. By pointing to the trail of your saved checkpoints, you&#8217;ll be able to restore your coach and proceed coaching. Right here\u2019s a fast instance:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">from ray.prepare.torch import TorchTrainer\n\nrestored_trainer = TorchTrainer.restore(\n    path=\"~\/ray_results\/dl_trainer_restore\",  # May also be a cloud storage path like S3\n    datasets=get_datasets(),\n)\noutcome = restored_trainer.match()\n<\/code><\/pre>\n<\/p><\/div>\n<p>To streamline restoration, you&#8217;ll be able to add auto resume logic to your script. This checks if a legitimate experiment listing exists and restores the coach if accessible. If not, it begins a brand new experiment:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">experiment_path = \"~\/ray_results\/dl_restore_autoresume\"\nif TorchTrainer.can_restore(experiment_path):\n    coach = TorchTrainer.restore(experiment_path, datasets=get_datasets())\nelse:\n    coach = TorchTrainer(\n        train_loop_per_worker=train_loop_per_worker,\n        datasets=get_datasets(),\n        scaling_config=prepare.ScalingConfig(num_workers=2),\n        run_config=prepare.RunConfig(\n            storage_path=\"~\/ray_results\",\n            title=\"dl_restore_autoresume\",\n        ),\n    )\noutcome = coach.match()<\/code><\/pre>\n<\/p><\/div>\n<p>To summarize, to offer fault tolerance and auto resume when utilizing Ray Practice libraries, set your <code>max_failures<\/code> parameter within the <code>FailureConfig<\/code> (we advocate setting it to <code>-1<\/code> to verify it would hold retrying till the SageMaker HyperPod node is rebooted or changed), and be sure you have enabled checkpointing in your code.<\/p>\n<h3>Ray employee pod failures<\/h3>\n<p>Along with the aforementioned mechanisms to get better from Ray Practice employee failures, Ray additionally gives fault tolerance on the employee pod degree. When a employee pod fails (this contains situations during which the <code>raylet<\/code> course of fails), the working duties and actors on it would fail and the objects owned by employee processes of this pod will likely be misplaced. On this case, the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/fault_tolerance\/tasks.html#fault-tolerance-tasks\" target=\"_blank\" rel=\"noopener\">duties<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/fault_tolerance\/actors.html#fault-tolerance-actors\" target=\"_blank\" rel=\"noopener\">actors<\/a>, and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/fault_tolerance\/objects.html#fault-tolerance-objects\" target=\"_blank\" rel=\"noopener\">objects<\/a> fault tolerance mechanisms will begin and attempt to get better the failures utilizing different employee pods.<\/p>\n<p>These mechanisms will likely be implicitly dealt with by the Ray Practice library. To be taught extra concerning the underlying fault tolerance on the duties, actors, objects (applied on the Ray Core degree), see <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.ray.io\/en\/latest\/ray-core\/fault-tolerance.html\" target=\"_blank\" rel=\"noopener\">Fault Tolerance<\/a>.<\/p>\n<p>In observe, because of this in case of a employee pod failure, the next happens:<\/p>\n<ul>\n<li>If there&#8217;s a free employee pod within the Ray cluster, Ray will get better the failed employee pod by changing it with the free employee pod.<\/li>\n<li>If there isn&#8217;t a free employee pod, however within the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a brand new employee pod onto one of many free SageMaker HyperPod nodes. This pod will be part of the working Ray cluster and the failure will likely be recovered utilizing this new employee pod.<\/li>\n<\/ul>\n<p>Within the context of KubeRay, Ray employee nodes are represented by Kubernetes pods, and failures at this degree can embody points equivalent to pod eviction or preemption brought on by software-level components.<\/p>\n<p>Nonetheless, one other crucial state of affairs to think about is {hardware} failures. If the underlying SageMaker HyperPod node turns into unavailable because of a {hardware} concern, equivalent to a GPU error, it will inevitably trigger the Ray employee pod working on that node to fail as properly. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster begin and can reboot or exchange the defective node. After the brand new wholesome node is added into the SageMaker HyperPod cluster, Ray will schedule a brand new employee pod onto the SageMaker HyperPod node and get better the interrupted coaching. On this case, each the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency options work collectively seamlessly and be sure that even in case of a {hardware} failure, your ML coaching workload can auto resume and decide up from the place it was interrupted.<\/p>\n<p>As you&#8217;ve seen, there are numerous built-in resiliency and fault-tolerance mechanisms that permit your Ray Practice workload on SageMaker HyperPod to get better and auto resume. As a result of these mechanisms will basically get better by restarting the coaching job, it&#8217;s essential that checkpointing is applied within the coaching script. It&#8217;s also usually suggested to save lots of the checkpoints on a shared and protracted path, equivalent to an <a rel=\"nofollow\" target=\"_blank\" href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener\">Amazon Easy Storage Service<\/a> (Amazon S3) bucket or FSx for Lustre file system.<\/p>\n<h2>Clear up<\/h2>\n<p>To delete your SageMaker HyperPod cluster created on this submit, you&#8217;ll be able to both use the SageMaker AI console or use the next AWS CLI command:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws sagemaker delete-cluster --cluster-name <cluster_name\/><\/code><\/pre>\n<\/p><\/div>\n<p>Cluster deletion will take a couple of minutes. You&#8217;ll be able to affirm profitable deletion after you see no clusters on the SageMaker AI console.<\/p>\n<p>In case you used the CloudFormation stack to create sources, you&#8217;ll be able to delete it utilizing the next command:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-bash\">aws cloudformation delete-stack --stack-name <stack_name\/><\/code><\/pre>\n<\/p><\/div>\n<h2>Conclusion<\/h2>\n<p>This submit demonstrated the right way to arrange and deploy Ray clusters on SageMaker HyperPod, highlighting key issues equivalent to storage configuration and fault tolerance and auto resume mechanisms.<\/p>\n<p>Operating Ray jobs on SageMaker HyperPod affords a robust resolution for distributed AI\/ML workloads, combining the flexibleness of Ray with the sturdy infrastructure of SageMaker HyperPod. This integration gives enhanced resiliency and auto resume capabilities, that are essential for long-running and resource-intensive duties. By utilizing Ray\u2019s distributed computing framework and the built-in options of SageMaker HyperPod, you&#8217;ll be able to effectively handle advanced ML workflows, particularly coaching workloads as coated on this submit. As AI\/ML workloads proceed to develop in scale and complexity, the mixture of Ray and SageMaker HyperPod affords a scalable, resilient, and environment friendly platform for tackling essentially the most demanding computational challenges in machine studying.<\/p>\n<p>To get began with SageMaker HyperPod, consult with the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/catalog.workshops.aws\/sagemaker-hyperpod-eks\/en-US\" target=\"_blank\" rel=\"noopener\">Amazon EKS Assist in Amazon SageMaker HyperPod workshop<\/a> and the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-hyperpod.html\" target=\"_blank\" rel=\"noopener\">Amazon SageMaker HyperPod Developer Information<\/a>. To be taught extra concerning the aws-do-ray framework, consult with the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/aws-samples\/aws-do-ray\" target=\"_blank\" rel=\"noopener\">GitHub repo<\/a>.<\/p>\n<hr\/>\n<h3>In regards to the Authors<\/h3>\n<p style=\"clear: both\"><strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/mvincig.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-102615 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/mvincig.jpg\" alt=\"mvincig\" width=\"100\" height=\"133\"\/><\/a>Mark Vinciguerra<\/strong> is an Affiliate Specialist Options Architect at Amazon Net Companies (AWS) based mostly in New York. He focuses on the Automotive and Manufacturing sector, specializing in serving to organizations architect, optimize, and scale synthetic intelligence and machine studying options, with explicit experience in autonomous automobile applied sciences. Previous to AWS, he went to Boston College and graduated with a level in Laptop Engineering.<\/p>\n<p style=\"clear: both\"><strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/flostahl.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-102616 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/flostahl.jpg\" alt=\"flostahl\" width=\"100\" height=\"133\"\/><\/a>Florian Stahl<\/strong> is a Worldwide Specialist Options Architect at AWS, based mostly in Hamburg, Germany. He makes a speciality of Synthetic Intelligence, Machine Studying, and Generative AI options, serving to prospects optimize and scale their AI\/ML workloads on AWS. With a background as a Knowledge Scientist, Florian focuses on working with prospects within the Autonomous Automobile house, bringing deep technical experience to assist organizations design and implement subtle machine studying options. He works intently with prospects worldwide to remodel their AI initiatives and maximize the worth of their machine studying investments on AWS.<\/p>\n<p style=\"clear: both\"><strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/anoopx.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-102618 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/anoopx.jpg\" alt=\"anoopx\" width=\"100\" height=\"133\"\/><\/a>Anoop Saha<\/strong> is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in Gen AI mannequin coaching and inference. He&#8217;s partnering with prime basis mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop has held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.<\/p>\n<p style=\"clear: both\"><strong><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/iankouls.jpg\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" loading=\"lazy\" class=\"alignleft wp-image-102617 size-full\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2025\/03\/26\/iankouls.jpg\" alt=\"iankouls\" width=\"100\" height=\"133\"\/><\/a>Alex Iankoulski<\/strong> is a Principal Options Architect, ML\/AI Frameworks, who focuses on serving to prospects orchestrate their AI workloads utilizing containers and accelerated computing infrastructure on AWS. He&#8217;s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world\u2019s greatest challenges.<\/p>\n<p>       \n      <\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Basis mannequin (FM) coaching and inference has led to a big improve in computational wants throughout the business. These fashions require large quantities of accelerated compute to coach and function successfully, pushing the boundaries of conventional computing infrastructure. They require environment friendly techniques for distributing workloads throughout a number of GPU accelerated servers, and optimizing [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":958,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[387,741,738,737,736,740,388,739],"class_list":["post-956","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-amazon","tag-distributed","tag-hyperpod","tag-jobs","tag-ray","tag-resilient","tag-sagemaker","tag-scalable"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/956","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=956"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/956\/revisions"}],"predecessor-version":[{"id":957,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/956\/revisions\/957"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/958"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=956"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=956"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=956"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-13 15:20:51 UTC -->