SageMaker – techtrendfeed.com

No-code information preparation for time collection forecasting utilizing Amazon SageMaker Canvas

Admin — Tue, 24 Jun 2025 14:54:43 +0000

Time collection forecasting helps companies predict future traits primarily based on historic information patterns, whether or not it’s for gross sales projections, stock administration, or demand forecasting. Conventional approaches require intensive information of statistical strategies and information science strategies to course of uncooked time collection information.

Amazon SageMaker Canvas gives no-code options that simplify information wrangling, making time collection forecasting accessible to all customers no matter their technical background. On this put up, we discover how SageMaker Canvas and SageMaker Information Wrangler present no-code information preparation methods that empower customers of all backgrounds to organize information and construct time collection forecasting fashions in a single interface with confidence.

Answer overview

Utilizing SageMaker Information Wrangler for information preparation permits for the modification of knowledge for predictive analytics with out programming information. On this resolution, we reveal the steps related to this course of. The answer consists of the next:

Information Import from various sources
Automated no-code algorithmic suggestions for information preparation
Step-by-step processes for preparation and evaluation
Visible interfaces for information visualization and evaluation
Export capabilities put up information preparation
In-built safety and compliance options

On this put up, we concentrate on information preparation for time collection forecasting utilizing SageMaker Canvas.

Walkthrough

The next is a walkthrough of the answer for information preparation utilizing Amazon SageMaker Canvas. For the walkthrough, you utilize the buyer electronics artificial dataset discovered on this SageMaker Canvas Immersion Day lab, which we encourage you to attempt. This shopper electronics associated time collection (RTS) dataset primarily comprises historic value information that corresponds to gross sales transactions over time. This dataset is designed to enhance goal time collection (TTS) information to enhance prediction accuracy in forecasting fashions, significantly for shopper electronics gross sales, the place value modifications can considerably influence shopping for conduct. The dataset can be utilized for demand forecasting, value optimization, and market evaluation within the shopper electronics sector.

Stipulations

For this walkthrough, you need to have the next conditions:

Answer walkthrough

Under, we’ll present the answer walkthrough and clarify how customers are in a position to make use of a dataset, put together the information utilizing no code utilizing Information Wrangler, and run and practice a time collection forecasting mannequin utilizing SageMaker Canvas.

Check in to the AWS Administration Console and go to Amazon SageMaker AI after which to Canvas. On the Get began web page, choose Import and put together choice. You will note the next choices to import your information set into Sagemaker Information Wrangler. First, choose Tabular Information as we will probably be using this information for our time collection forecasting. You will note the next choices out there to pick from:

Native add
Canvas Datasets
Amazon S3
Amazon Redshift
Amazon Athena
Databricks
MySQL
PostgreSQL
SQL Server
RDS

For this demo, choose Native add. While you use this feature, the information is saved within the SageMaker occasion, particularly on an Amazon Elastic File System (Amazon EFS) storage quantity within the SageMaker Studio setting. This storage is tied to the SageMaker Studio occasion, however for extra everlasting information storage functions, Amazon Easy Storage Service (Amazon S3) is an efficient choice when working with SageMaker Information Wrangler. For long run information administration, Amazon S3 is beneficial.

Choose the consumer_electronics.csv file from the conditions. After choosing the file to import, you should use the Import settings panel to set your required configurations. For the aim of this demo, go away the choices to their default values.

After the import is full, use the Information move choices to change the newly imported information. For future information forecasting, you could want to wash up information for the service to correctly perceive the values and disrespect any errors within the information. SageMaker Canvas has varied choices to perform this. Choices embody Chat for information prep with pure language information modifications and Add Remodel. Chat for information prep could also be greatest for customers preferring pure language processing (NLP) interactions and will not be conversant in technical information transformations. Add remodel is greatest for information professionals who know which transformations they wish to apply to their information.

For time collection forecasting utilizing Amazon SageMaker Canvas, information should be ready in a sure means for the service to correctly forecast and perceive the information. To make a time collection forecast utilizing SageMaker Canvas, the documentation linked mentions the next necessities:

A timestamp column with all values having the datetime sort.
A goal column that has the values that you just’re utilizing to forecast future values.
An merchandise ID column that comprises distinctive identifiers for every merchandise in your dataset, reminiscent of SKU numbers.

The datetime values within the timestamp column should use one of many following codecs:

YYYY-MM-DD HH:MM:SS
YYYY-MM-DDTHH:MM:SSZ
YYYY-MM-DD
MM/DD/YY
MM/DD/YY HH:MM
MM/DD/YYYY
YYYY/MM/DD HH:MM:SS
YYYY/MM/DD
DD/MM/YYYY
DD/MM/YY
DD-MM-YY
DD-MM-YYYY

You may make forecasts for the next intervals:

1 min
5 min
15 min
30 min
1 hour
1 day
1 week
1 month
1 12 months

For this instance, take away the $ within the information, through the use of the Chat for information prep choice. Give the chat a immediate reminiscent of Are you able to eliminate the $ in my information, and it’ll generate code to accommodate your request and modify the information, supplying you with a no-code resolution to organize the information for future modeling and predictive evaluation. Select Add to Steps to simply accept this code and apply modifications to the information.

You can too convert values to drift information sort and test for lacking information in your uploaded CSV file utilizing both Chat for information prep or Add Remodel choices. To drop lacking values utilizing Information Remodel:

Choose Add Remodel from the interface
Select Deal with Lacking from the remodel choices
Choose Drop lacking from the out there operations
Select the columns you wish to test for lacking values
Choose Preview to confirm the modifications
Select Add to verify and apply the transformation

For time-series forecasting, inferring lacking values and resampling the information set to a sure frequency (hourly, every day, or weekly) are additionally essential. In SageMaker Information Wrangler, the frequency of knowledge will be altered by selecting Add Remodel, choosing Time Sequence, choosing Resample from the Remodel drop down, after which choosing the Timestamp dropdown, ts on this instance. Then, you may choose superior choices. For instance, select Frequency unit after which choose the specified frequency from the listing.

SageMaker Information Wrangler gives a number of strategies to deal with lacking values in time-series information via its Deal with lacking remodel. You may select from choices reminiscent of ahead fill or backward fill, that are significantly helpful for sustaining the temporal construction of the information. These operations will be utilized through the use of pure language instructions in Chat for information prep, permitting versatile and environment friendly dealing with of lacking values in time-series forecasting preparation.

To create the information move, select Create mannequin. Then, select Run Validation, which checks the information to verify the processes had been finished accurately. After this step of knowledge transformation, you may entry extra choices by choosing the purple plus signal. The choices embody Get information insights, Chat for information prep, Mix information, Create mannequin, and Export.

The ready information can then be related to SageMaker AI for time collection forecasting methods, on this case, to foretell the long run demand primarily based on the historic information that has been ready for machine studying.

When utilizing SageMaker, it’s also essential to contemplate information storage and safety. For the native import function, information is saved on Amazon EFS volumes and encrypted by default. For extra everlasting storage, Amazon S3 is beneficial. S3 gives security measures reminiscent of server-side encryption (SSE-S3, SSE-KMS, or SSE-C), fine-grained entry controls via AWS Id and Entry Administration (IAM) roles and bucket insurance policies, and the flexibility to make use of VPC endpoints for added community safety. To assist guarantee information safety in both case, it’s essential to implement correct entry controls, use encryption for information at relaxation and in transit, often audit entry logs, and comply with the precept of least privilege when assigning permissions.

On this subsequent step, you discover ways to practice a mannequin utilizing SageMaker Canvas. Primarily based on the earlier step, choose the purple plus signal and choose Create Mannequin, after which choose Export to create a mannequin. After choosing a column to foretell (choose value for this instance), you go to the Construct display, with choices reminiscent of Fast construct and Commonplace construct. Primarily based on the column chosen, the mannequin will predict future values primarily based on the information that’s getting used.

Clear up

To keep away from incurring future costs, delete the SageMaker Information Wrangler information move and S3 Buckets if used for storage.

Within the SageMaker console, navigate to Canvas
Choose Import and put together
Discover your information move within the listing
Click on the three dots (⋮) menu subsequent to your move
Choose Delete to take away the information move

Should you used S3 for storage:

Open the Amazon S3 console
Navigate to your bucket
Choose the bucket used for this venture
Select Delete
Kind the bucket identify to verify deletion
Choose Delete bucket

Conclusion

On this put up, we confirmed you the way Amazon SageMaker Information Wrangler gives a no-code resolution for time collection information preparation, historically a process requiring technical experience. Through the use of the intuitive interface of the Information Wrangler console and pure language-powered instruments, even customers who don’t have a technical background can successfully put together their information for future forecasting wants. This democratization of knowledge preparation not solely saves time and sources but additionally empowers a wider vary of execs to have interaction in data-driven decision-making.

In regards to the creator

Muni T. Bondu is a Options Architect at Amazon Internet Providers (AWS), primarily based in Austin, Texas. She holds a Bachelor of Science in Laptop Science, with concentrations in Synthetic Intelligence and Human-Laptop Interplay, from the Georgia Institute of Know-how.

How iFood constructed a platform to run lots of of machine studying fashions with Amazon SageMaker Inference

Admin — Tue, 08 Apr 2025 21:30:00 +0000

Headquartered in São Paulo, Brazil, iFood is a nationwide non-public firm and the chief in food-tech in Latin America, processing tens of millions of orders month-to-month. iFood has stood out for its technique of incorporating cutting-edge expertise into its operations. With the help of AWS, iFood has developed a sturdy machine studying (ML) inference infrastructure, utilizing companies similar to Amazon SageMaker to effectively create and deploy ML fashions. This partnership has allowed iFood not solely to optimize its inner processes, but in addition to supply revolutionary options to its supply companions and eating places.

iFood’s ML platform contains a set of instruments, processes, and workflows developed with the next aims:

Speed up the event and coaching of AI/ML fashions, making them extra dependable and reproducible
Be sure that deploying these fashions to manufacturing is dependable, scalable, and traceable
Facilitate the testing, monitoring, and analysis of fashions in manufacturing in a clear, accessible, and standardized method

To attain these aims, iFood makes use of SageMaker, which simplifies the coaching and deployment of fashions. Moreover, the combination of SageMaker options in iFood’s infrastructure automates essential processes, similar to producing coaching datasets, coaching fashions, deploying fashions to manufacturing, and constantly monitoring their efficiency.

On this put up, we present how iFood makes use of SageMaker to revolutionize its ML operations. By harnessing the facility of SageMaker, iFood streamlines the whole ML lifecycle, from mannequin coaching to deployment. This integration not solely simplifies advanced processes but in addition automates essential duties.

AI inference at iFood
iFood has harnessed the facility of a sturdy AI/ML platform to raise the client expertise throughout its numerous touchpoints. Utilizing the chopping fringe of AI/ML capabilities, the corporate has developed a set of transformative options to handle a large number of buyer use circumstances:

Personalised suggestions – At iFood, AI-powered advice fashions analyze a buyer’s previous order historical past, preferences, and contextual components to recommend essentially the most related eating places and menu gadgets. This personalised method makes certain prospects uncover new cuisines and dishes tailor-made to their tastes, enhancing satisfaction and driving elevated order volumes.
Clever order monitoring – iFood’s AI techniques monitor orders in actual time, predicting supply instances with a excessive diploma of accuracy. By understanding components like site visitors patterns, restaurant preparation instances, and courier areas, the AI can proactively notify prospects of their order standing and anticipated arrival, lowering uncertainty and anxiousness throughout the supply course of.
Automated buyer Service – To deal with the 1000’s of every day buyer inquiries, iFood has developed an AI-powered chatbot that may rapidly resolve frequent points and questions. This clever digital agent understands pure language, accesses related information, and offers personalised responses, delivering quick and constant help with out overburdening the human customer support group.
Grocery buying help – Integrating superior language fashions, iFood’s app permits prospects to easily converse or kind their recipe wants or grocery checklist, and the AI will robotically generate an in depth buying checklist. This voice-enabled grocery planning characteristic saves prospects effort and time, enhancing their total buying expertise.

By these numerous AI-powered initiatives, iFood is ready to anticipate buyer wants, streamline key processes, and ship a constantly distinctive expertise—additional strengthening its place because the main food-tech platform in Latin America.

Resolution overview

The next diagram illustrates iFood’s legacy structure, which had separate workflows for information science and engineering groups, creating challenges in effectively deploying correct, real-time machine studying fashions into manufacturing techniques.

Up to now, the information science and engineering groups at iFood operated independently. Knowledge scientists would construct fashions utilizing notebooks, regulate weights, and publish them onto companies. Engineering groups would then wrestle to combine these fashions into manufacturing techniques. This disconnection between the 2 groups made it difficult to deploy correct real-time ML fashions.

To beat this problem, iFood constructed an inner ML platform that helped bridge this hole. This platform has streamlined the workflow, offering a seamless expertise for creating, coaching, and delivering fashions for inference. It offers a centralized integration the place information scientists might construct, prepare, and deploy fashions seamlessly from an built-in method, contemplating the event workflow of the groups. The interplay with engineering groups might devour these fashions and combine them into purposes from each a web-based and offline perspective, enabling a extra environment friendly and streamlined workflow.

By breaking down the boundaries between information science and engineering, AWS AI platforms empowered iFood to make use of the total potential of their information and speed up the event of AI purposes. The automated deployment and scalable inference capabilities supplied by SageMaker made certain that fashions had been available to energy clever purposes and supply correct predictions on demand. This centralization of ML companies as a product has been a recreation changer for iFood, permitting them to concentrate on constructing high-performing fashions moderately than the intricate particulars of inference.

One of many core capabilities of iFood’s ML platform is the power to offer the infrastructure to serve predictions. A number of use circumstances are supported by the inference made out there via ML Go!, accountable for deploying SageMaker pipelines and endpoints. The previous are used to schedule offline predictions jobs, and the latter are employed to create mannequin companies, to be consumed by the appliance companies. The next diagram illustrates iFood’s up to date structure, which contains an inner ML platform constructed to streamline workflows between information science and engineering groups, enabling environment friendly deployment of machine studying fashions into manufacturing techniques.

Integrating mannequin deployment into the service improvement course of was a key initiative to allow information scientists and ML engineers to deploy and preserve these fashions. The ML platform empowers the constructing and evolution of ML techniques. A number of different integrations with different essential platforms, just like the characteristic platform and information platform, had been delivered to extend the expertise for the customers as a complete. The method of consuming ML-based selections was streamlined—nevertheless it doesn’t finish there. The iFood’s ML platform, ML Go!, is now specializing in new inference capabilities, supported by latest options by which the iFood’s group was accountable for supporting their ideation and improvement. The next diagram illustrates the ultimate structure of iFood’s ML platform, showcasing how mannequin deployment is built-in into the service improvement course of, the platform’s connections with characteristic and information platforms, and its concentrate on new inference capabilities.

One of many largest adjustments is oriented to the creation of 1 abstraction for connecting with SageMaker Endpoints and Jobs, known as ML Go! Gateway, and in addition, the separation of considerations inside the Endpoints, by means of the Inference Elements characteristic, making the serving quicker and extra environment friendly. On this new inference construction, the Endpoints are additionally managed by the ML Go! CI/CD, leaving for the pipelines, to deal solely with mannequin promotions, and never the infrastructure itself. It is going to cut back the lead time to adjustments, and alter failure ratio over the deployments.

Utilizing SageMaker Inference Mannequin Serving Containers:

One of many key options of recent machine studying platforms is the standardization of machine studying and AI companies. By encapsulating fashions and dependencies as Docker containers, these platforms guarantee consistency and portability throughout totally different environments and phases of ML. Utilizing SageMaker, information scientists and builders can use pre-built Docker containers, making it easy to deploy and handle ML companies. As a venture progresses, they’ll spin up new cases and configure them in keeping with their particular necessities. SageMaker offers Docker containers which might be designed to work seamlessly with SageMaker. These containers present a standardized and scalable atmosphere for operating ML workloads on SageMaker.

SageMaker offers a set of pre-built containers for fashionable ML frameworks and algorithms, similar to TensorFlow, PyTorch, XGBoost, and lots of others. These containers are optimized for efficiency and embody all the required dependencies and libraries pre-installed, making it easy to get began together with your ML tasks. Along with the pre-built containers, it offers choices to convey your individual customized containers to SageMaker, which embody your particular ML code, dependencies, and libraries. This may be notably helpful for those who’re utilizing a much less frequent framework or have particular necessities that aren’t met by the pre-built containers.

iFood was extremely centered on utilizing customized containers for the coaching and deployment of ML workloads, offering a constant and reproducible atmosphere for ML experiments, and making it easy to trace and replicate outcomes. Step one on this journey was to standardize the ML customized code, which is definitely the piece of code that the information scientists ought to concentrate on. With out a pocket book, and with BruceML, the way in which to create the code to coach and serve fashions has modified, to be encapsulated from the beginning as container photographs. BruceML was accountable for creating the scaffolding required to seamlessly combine with the SageMaker platform, permitting the groups to make the most of its varied options, similar to hyperparameter tuning, mannequin deployment, and monitoring. By standardizing ML companies and utilizing containerization, trendy platforms democratize ML, enabling iFood to quickly construct, deploy, and scale clever purposes.

Automating mannequin deployment and ML system retraining

When operating ML fashions in manufacturing, it’s essential to have a sturdy and automatic course of for deploying and recalibrating these fashions throughout totally different use circumstances. This helps be sure that the fashions stay correct and performant over time. The group at iFood understood this problem nicely—not solely the mannequin is deployed. As an alternative, they depend on one other idea to maintain issues operating nicely: ML pipelines.

Utilizing Amazon SageMaker Pipelines, they had been capable of construct a CI/CD system for ML, to ship automated retraining and mannequin deployment. Additionally they built-in this whole system with the corporate’s present CI/CD pipeline, making it environment friendly and in addition sustaining good DevOps practices used at iFood. It begins with the ML Go! CI/CD pipeline pushing the most recent code artifacts containing the mannequin coaching and deployment logic. It consists of the coaching course of, which makes use of totally different containers for implementing the whole pipeline. When coaching is full, the inference pipeline could be executed to start the mannequin deployment. It may be a completely new mannequin, or the promotion of a brand new model to extend the efficiency of an present one. Each mannequin out there for deployment can be secured and registered robotically by ML Go! in Amazon SageMaker Mannequin Registry, offering versioning and monitoring capabilities.

The ultimate step is determined by the meant inference necessities. For batch prediction use circumstances, the pipeline creates a SageMaker batch rework job to run large-scale predictions. For real-time inference, the pipeline deploys the mannequin to a SageMaker endpoint, fastidiously choosing the suitable container variant and occasion kind to deal with the anticipated manufacturing site visitors and latency wants. This end-to-end automation has been a recreation changer for iFood, permitting them to quickly iterate on their ML fashions and deploy updates and recalibrations rapidly and confidently throughout their varied use circumstances. SageMaker Pipelines has supplied a streamlined strategy to orchestrate these advanced workflows, ensuring mannequin operationalization is environment friendly and dependable.

Working inference in numerous SLA codecs

iFood makes use of the inference capabilities of SageMaker to energy its clever purposes and ship correct predictions to its prospects. By integrating the strong inference choices out there in SageMaker, iFood has been capable of seamlessly deploy ML fashions and make them out there for real-time and batch predictions. For iFood’s on-line, real-time prediction use circumstances, the corporate makes use of SageMaker hosted endpoints to deploy their fashions. These endpoints are built-in into iFood’s customer-facing purposes, permitting for speedy inference on incoming information from customers. SageMaker handles the scaling and administration of those endpoints, ensuring that iFood’s fashions are available to offer correct predictions and improve the consumer expertise.

Along with real-time predictions, iFood additionally makes use of SageMaker batch rework to carry out large-scale, asynchronous inference on datasets. That is notably helpful for iFood’s information preprocessing and batch prediction necessities, similar to producing suggestions or insights for his or her restaurant companions. SageMaker batch rework jobs allow iFood to effectively course of huge quantities of knowledge, additional enhancing their data-driven decision-making.

Constructing upon the success of standardization to SageMaker Inference, iFood has been instrumental in partnering with the SageMaker Inference group to construct and improve key AI inference capabilities inside the SageMaker platform. For the reason that early days of ML, iFood has supplied the SageMaker Inference group with priceless inputs and experience, enabling the introduction of a number of new options and optimizations:

Value and efficiency optimizations for generative AI inference – iFood helped the SageMaker Inference group develop revolutionary strategies to optimize the usage of accelerators, enabling SageMaker Inference to scale back basis mannequin (FM) deployment prices by 50% on common and latency by 20% on common with inference parts. This breakthrough delivers important value financial savings and efficiency enhancements for purchasers operating generative AI workloads on SageMaker.
Scaling enhancements for AI inference – iFood’s experience in distributed techniques and auto scaling has additionally helped the SageMaker group develop superior capabilities to higher deal with the scaling necessities of generative AI fashions. These enhancements cut back auto scaling instances by as much as 40% and auto scaling detection by six instances, ensuring that prospects can quickly scale their inference workloads on SageMaker to satisfy spikes in demand with out compromising efficiency.
Streamlined generative AI mannequin deployment for inference – Recognizing the necessity for simplified mannequin deployment, iFood collaborated with AWS to introduce the power to deploy open supply giant language fashions (LLMs) and FMs with only a few clicks. This user-friendly performance removes the complexity historically related to deploying these superior fashions, empowering extra prospects to harness the facility of AI.
Scale-to-zero for inference endpoints – iFood performed an important position in collaborating with SageMaker Inference to develop and launch the scale-to-zero characteristic for SageMaker inference endpoints. This revolutionary functionality permits inference endpoints to robotically shut down when not in use and quickly spin up on demand when new requests arrive. This characteristic is especially helpful for dev/take a look at environments, low-traffic purposes, and inference use circumstances with various inference calls for, as a result of it eliminates idle useful resource prices whereas sustaining the power to rapidly serve requests when wanted. The dimensions-to-zero performance represents a serious development in cost-efficiency for AI inference, making it extra accessible and economically viable for a wider vary of use circumstances.
Packaging AI mannequin inference extra effectively – To additional simplify the AI mannequin lifecycle, iFood labored with AWS to boost SageMaker’s capabilities for packaging LLMs and fashions for deployment. These enhancements make it easy to arrange and deploy these AI fashions, accelerating their adoption and integration.
Multi-model endpoints for GPU – iFood collaborated with the SageMaker Inference group to launch multi-model endpoints for GPU-based cases. This enhancement means that you can deploy a number of AI fashions on a single GPU-enabled endpoint, considerably enhancing useful resource utilization and cost-efficiency. By profiting from iFood’s experience in GPU optimization and mannequin serving, SageMaker now gives an answer that may dynamically load and unload fashions on GPUs, lowering infrastructure prices by as much as 75% for purchasers with a number of fashions and ranging site visitors patterns.
Asynchronous inference – Recognizing the necessity for dealing with long-running inference requests, the group at iFood labored carefully with the SageMaker Inference group to develop and launch Asynchronous Inference in SageMaker. This characteristic allows you to course of giant payloads or time-consuming inference requests with out the constraints of real-time API calls. iFood’s expertise with large-scale distributed techniques helped form this resolution, which now permits for higher administration of resource-intensive inference duties, and the power to deal with inference requests that may take a number of minutes to finish. This functionality has opened up new use circumstances for AI inference, notably in industries coping with advanced information processing duties similar to genomics, video evaluation, and monetary modeling.

By carefully partnering with the SageMaker Inference group, iFood has performed a pivotal position in driving the fast evolution of AI inference and generative AI inference capabilities in SageMaker. The options and optimizations launched via this collaboration are empowering AWS prospects to unlock the transformative potential of inference with higher ease, cost-effectiveness, and efficiency.

“At iFood, we had been on the forefront of adopting transformative machine studying and AI applied sciences, and our partnership with the SageMaker Inference product group has been instrumental in shaping the way forward for AI purposes. Collectively, we’ve developed methods to effectively handle inference workloads, permitting us to run fashions with pace and price-performance. The teachings we’ve realized supported us within the creation of our inner platform, which may function a blueprint for different organizations seeking to harness the facility of AI inference. We imagine the options we have now inbuilt collaboration will broadly assist different enterprises who run inference workloads on SageMaker, unlocking new frontiers of innovation and enterprise transformation, by fixing recurring and essential issues within the universe of machine studying engineering.”

– says Daniel Vieira, ML Platform supervisor at iFood.

Conclusion

Utilizing the capabilities of SageMaker, iFood remodeled its method to ML and AI, unleashing new prospects for enhancing the client expertise. By constructing a sturdy and centralized ML platform, iFood has bridged the hole between its information science and engineering groups, streamlining the mannequin lifecycle from improvement to deployment. The mixing of SageMaker options has enabled iFood to deploy ML fashions for each real-time and batch-oriented use circumstances. For real-time, customer-facing purposes, iFood makes use of SageMaker hosted endpoints to offer speedy predictions and improve the consumer expertise. Moreover, the corporate makes use of SageMaker batch rework to effectively course of giant datasets and generate insights for its restaurant companions. This flexibility in inference choices has been key to iFood’s skill to energy a various vary of clever purposes.

The automation of deployment and retraining via ML Go!, supported by SageMaker Pipelines and SageMaker Inference, has been a recreation changer for iFood. This has enabled the corporate to quickly iterate on its ML fashions, deploy updates with confidence, and preserve the continuing efficiency and reliability of its clever purposes. Furthermore, iFood’s strategic partnership with the SageMaker Inference group has been instrumental in driving the evolution of AI inference capabilities inside the platform. By this collaboration, iFood has helped form value and efficiency optimizations, scale enhancements, and simplify mannequin deployment options—all of which are actually benefiting a wider vary of AWS prospects.

By profiting from the capabilities SageMaker gives, iFood has been capable of unlock the transformative potential of AI and ML, delivering revolutionary options that improve the client expertise and strengthen its place because the main food-tech platform in Latin America. This journey serves as a testomony to the facility of cloud-based AI infrastructure and the worth of strategic partnerships in driving technology-driven enterprise transformation.

By following iFood’s instance, you may unlock the total potential of SageMaker for your corporation, driving innovation and staying forward in your trade.

Concerning the Authors

Daniel Vieira is a seasoned Machine Studying Engineering Supervisor at iFood, with a robust tutorial background in pc science, holding each a bachelor’s and a grasp’s diploma from the Federal College of Minas Gerais (UFMG). With over a decade of expertise in software program engineering and platform improvement, Daniel leads iFood’s ML platform, constructing a sturdy, scalable ecosystem that drives impactful ML options throughout the corporate. In his spare time, Daniel Vieira enjoys music, philosophy, and studying about new issues whereas consuming cup of espresso.

Debora Fanin serves as a Senior Buyer Options Supervisor AWS for the Digital Native Enterprise section in Brazil. On this position, Debora manages buyer transformations, creating cloud adoption methods to help cost-effective, well timed deployments. Her tasks embody designing change administration plans, guiding solution-focused selections, and addressing potential dangers to align with buyer aims. Debora’s tutorial path features a Grasp’s diploma in Administration at FEI and certifications similar to Amazon Options Architect Affiliate and Agile credentials. Her skilled historical past spans IT and venture administration roles throughout numerous sectors, the place she developed experience in cloud applied sciences, information science, and buyer relations.

Saurabh Trikande is a Senior Product Supervisor for Amazon Bedrock and Amazon SageMaker Inference. He’s keen about working with prospects and companions, motivated by the aim of democratizing AI. He focuses on core challenges associated to deploying advanced AI purposes, inference with multi-tenant fashions, value optimizations, and making the deployment of generative AI fashions extra accessible. In his spare time, Saurabh enjoys climbing, studying about revolutionary applied sciences, following TechCrunch, and spending time along with his household.

Gopi Mudiyala is a Senior Technical Account Supervisor at AWS. He helps prospects within the monetary companies trade with their operations in AWS. As a machine studying fanatic, Gopi works to assist prospects succeed of their ML journey. In his spare time, he likes to play badminton, spend time with household, and journey.

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Admin — Wed, 02 Apr 2025 22:21:27 +0000

Basis mannequin (FM) coaching and inference has led to a big improve in computational wants throughout the business. These fashions require large quantities of accelerated compute to coach and function successfully, pushing the boundaries of conventional computing infrastructure. They require environment friendly techniques for distributing workloads throughout a number of GPU accelerated servers, and optimizing developer velocity in addition to efficiency.

Ray is an open supply framework that makes it easy to create, deploy, and optimize distributed Python jobs. At its core, Ray affords a unified programming mannequin that enables builders to seamlessly scale their functions from a single machine to a distributed cluster. It gives a set of high-level APIs for duties, actors, and information that summary away the complexities of distributed computing, enabling builders to give attention to the core logic of their functions. Ray promotes the identical coding patterns for each a easy machine studying (ML) experiment and a scalable, resilient manufacturing software. Ray’s key options embody environment friendly activity scheduling, fault tolerance, and computerized useful resource administration, making it a robust software for constructing a variety of distributed functions, from ML fashions to real-time information processing pipelines. With its rising ecosystem of libraries and instruments, Ray has turn out to be a well-liked selection for organizations wanting to make use of the ability of distributed computing to sort out advanced and data-intensive issues.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the flexibleness to create and use your individual software program stack, but in addition gives optimum efficiency via identical backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of Ray gives a robust framework to scale up your generative AI workloads.

On this submit, we display the steps concerned in working Ray jobs on SageMaker HyperPod.

Overview of Ray

This part gives a high-level overview of the Ray instruments and frameworks for AI/ML workloads. We primarily give attention to ML coaching use instances.

Ray is an open-source distributed computing framework designed to run extremely scalable and parallel Python functions. Ray manages, executes, and optimizes compute wants throughout AI workloads. It unifies infrastructure via a single, versatile framework—enabling AI workloads from information processing, to mannequin coaching, to mannequin serving and past.

For distributed jobs, Ray gives intuitive instruments for parallelizing and scaling ML workflows. It permits builders to give attention to their coaching logic with out the complexities of useful resource allocation, activity scheduling, and inter-node communication.

At a excessive degree, Ray is made up of three layers:

Ray Core: The inspiration of Ray, offering primitives for parallel and distributed computing
Ray AI libraries:
- Ray Practice – A library that simplifies distributed coaching by providing built-in help for fashionable ML frameworks like PyTorch, TensorFlow, and Hugging Face
- Ray Tune – A library for scalable hyperparameter tuning
- Ray Serve – A library for distributed mannequin deployment and serving
Ray clusters: A distributed computing platform the place employee nodes run person code as Ray duties and actors, usually within the cloud

On this submit, we dive deep into working Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and a lot of linked employee nodes. The top node orchestrates activity scheduling, useful resource allocation, and communication between nodes. The ray employee nodes execute the distributed workloads utilizing Ray duties and actors, equivalent to mannequin coaching or information preprocessing.

Ray clusters and Kubernetes clusters pair properly collectively. By working a Ray cluster on Kubernetes utilizing the KubeRay operator, each Ray customers and Kubernetes directors profit from the graceful path from improvement to manufacturing. For this use case, we use a SageMaker HyperPod cluster orchestrated via Amazon Elastic Kubernetes Service (Amazon EKS).

The KubeRay operator lets you run a Ray cluster on a Kubernetes cluster. KubeRay creates the next {custom} useful resource definitions (CRDs):

RayCluster – The first useful resource for managing Ray situations on Kubernetes. The nodes in a Ray cluster manifest as pods within the Kubernetes cluster.
RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting duties or batches of duties to be executed by the Ray cluster. A RayJob additionally manages the lifecycle of the Ray cluster, making it ephemeral by robotically spinning up the cluster when the job is submitted and shutting it down when the job is full.
RayService – A Ray cluster and a Serve software that runs on prime of it right into a single Kubernetes manifest. It permits for the deployment of Ray functions that have to be uncovered for exterior communication, sometimes via a service endpoint.

For the rest of this submit, we don’t give attention to RayJob or RayService; we give attention to making a persistent Ray cluster to run distributed ML coaching jobs.

When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we’ll dive deeper into later on this submit. This mixture gives an answer for dealing with dynamic workloads, sustaining excessive availability, and offering seamless restoration from node failures, which is essential for long-running jobs.

Overview of SageMaker HyperPod

On this part, we introduce SageMaker HyperPod and its built-in resiliency options to offer infrastructure stability.

Generative AI workloads equivalent to coaching, inference, and fine-tuning contain constructing, sustaining, and optimizing giant clusters of hundreds of GPU accelerated situations. For distributed coaching, the purpose is to effectively parallelize workloads throughout these situations so as to maximize cluster utilization and reduce time to coach. For giant-scale inference, it’s vital to attenuate latency, maximize throughput, and seamlessly scale throughout these situations for the perfect person expertise. SageMaker HyperPod is a purpose-built infrastructure to deal with these wants. It removes the undifferentiated heavy lifting concerned in constructing, sustaining, and optimizing a big GPU accelerated cluster. It additionally gives flexibility to totally customise your coaching or inference setting and compose your individual software program stack. You should utilize both Slurm or Amazon EKS for orchestration with SageMaker HyperPod.

Because of their large dimension and the necessity to prepare on giant quantities of knowledge, FMs are sometimes educated and deployed on giant compute clusters composed of hundreds of AI accelerators equivalent to GPUs and AWS Trainium. A single failure in one among these thousand accelerators can interrupt your complete coaching course of, requiring handbook intervention to determine, isolate, debug, restore, and get better the defective node within the cluster. This workflow can take a number of hours for every failure and because the scale of the cluster grows, it’s widespread to see a failure each few days and even each few hours. SageMaker HyperPod gives resiliency towards infrastructure failures by making use of brokers that constantly run well being checks on cluster situations, repair the unhealthy situations, reload the final legitimate checkpoint, and resume the coaching—with out person intervention. In consequence, you’ll be able to prepare your fashions as much as 40% sooner. You may also SSH into an occasion within the cluster for debugging and collect insights on hardware-level optimization throughout multi-node coaching. Orchestrators like Slurm or Amazon EKS facilitate environment friendly allocation and administration of sources, present optimum job scheduling, monitor useful resource utilization, and automate fault tolerance.

Resolution overview

This part gives an summary of the right way to run Ray jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the structure and the method of making a SageMaker HyperPod cluster, putting in the KubeRay operator, and deploying a Ray coaching job.

Though this submit gives a step-by-step information to manually create the cluster, be happy to take a look at the aws-do-ray mission, which goals to simplify the deployment and scaling of distributed Python software utilizing Ray on Amazon EKS or SageMaker HyperPod. It makes use of Docker to containerize the instruments essential to deploy and handle Ray clusters, jobs, and companies. Along with the aws-do-ray mission, we’d like to spotlight the Amazon SageMaker Hyperpod EKS workshop, which affords an end-to-end expertise for working varied workloads on SageMaker Hyperpod clusters. There are a number of examples of coaching and inference workloads from the GitHub repository awsome-distributed-training.

As launched earlier on this submit, KubeRay simplifies the deployment and administration of Ray functions on Kubernetes. The next diagram illustrates the answer structure.

Create a SageMaker HyperPod cluster

Conditions

Earlier than deploying Ray on SageMaker HyperPod, you want a HyperPod cluster:

In case you want to deploy HyperPod on an present EKS cluster, please observe the directions right here which embody:

EKS cluster – You’ll be able to affiliate SageMaker HyperPod compute to an present EKS cluster that satisfies the set of stipulations. Alternatively and beneficial, you’ll be able to deploy a ready-made EKS cluster with a single AWS CloudFormation template. Consult with the GitHub repo for directions on organising an EKS cluster.
Customized sources – Operating multi-node distributed coaching requires varied sources, equivalent to gadget plugins, Container Storage Interface (CSI) drivers, and coaching operators, to be pre-deployed on the EKS cluster. You additionally have to deploy further sources for the well being monitoring agent and deep well being test. HyperPodHelmCharts simplify the method utilizing Helm, one among mostly used bundle mangers for Kubernetes. Consult with Set up packages on the Amazon EKS cluster utilizing Helm for set up directions.

The next present an instance workflow for making a HyperPod cluster on an present EKS Cluster after deploying stipulations. That is for reference solely and never required for the short deploy possibility.

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        },
        {
            "InstanceGroupName": "head-group",
            "InstanceType": "ml.m5.2xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "${SECURITY_GROUP_ID}"
        ],
        "Subnets": [
            "${SUBNET_ID}"
        ]
    },
    "NodeRecovery": "Automated"
}
EOL

The supplied configuration file accommodates two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep well being test at any time when new GPU or Trainium situations are added
“NodeRecovery”: “Automated” – Permits SageMaker HyperPod automated node restoration

You’ll be able to create a SageMaker HyperPod compute with the next AWS Command Line Interface (AWS CLI) command (AWS CLI model 2.17.47 or newer is required):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To confirm the cluster standing, you should use the next command:

aws sagemaker list-clusters --output desk

This command shows the cluster particulars, together with the cluster title, standing, and creation time:

------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                    ListClusters                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                 ClusterSummaries                                                                 ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||                           ClusterArn                           |        ClusterName        | ClusterStatus  |           CreationTime             ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||  arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf |         ml-cluster        |   InService     |  2025-03-03T16:45:05.320000+00:00  ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|

Alternatively, you’ll be able to confirm the cluster standing on the SageMaker console. After a quick interval, you’ll be able to observe that the standing for the nodes transitions to Operating.

Create an FSx for Lustre shared file system

For us to deploy the Ray cluster, we want the SageMaker HyperPod cluster to be up and working, and moreover we want a shared storage quantity (for instance, an Amazon FSx for Lustre file system). It is a shared file system that the SageMaker HyperPod nodes can entry. This file system could be provisioned statically earlier than launching your SageMaker HyperPod cluster or dynamically afterwards.

Specifying a shared storage location (equivalent to cloud storage or NFS) is non-compulsory for single-node clusters, however it’s required for multi-node clusters. Utilizing an area path will increase an error throughout checkpointing for multi-node clusters.

The Amazon FSx for Lustre CSI driver makes use of IAM roles for service accounts (IRSA) to authenticate AWS API calls. To make use of IRSA, an IAM OpenID Join (OIDC) supplier must be related to the OIDC issuer URL that comes provisioned your EKS cluster.

Create an IAM OIDC identification supplier on your cluster with the next command:

eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo replace
helm improve --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
  --namespace kube-system

This Helm chart features a service account named fsx-csi-controller-sa that will get deployed within the kube-system namespace.

Use the eksctl CLI to create an AWS Id and Entry Administration (IAM) position sure to the service account utilized by the motive force, attaching the AmazonFSxFullAccess AWS managed coverage:

eksctl create iamserviceaccount 
  --name fsx-csi-controller-sa 
  --override-existing-serviceaccounts 
  --namespace kube-system 
  --cluster $EKS_CLUSTER_NAME 
  --attach-policy-arn arn:aws:iam::aws:coverage/AmazonFSxFullAccess 
  --approve 
  --role-name AmazonEKSFSxLustreCSIDriverFullAccess 
  --region $AWS_REGION

The --override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a brand new one and updates the metadata of the present service account as a substitute.

Annotate the motive force’s service account with the Amazon Useful resource Identify (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM position that was created:

SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Position.Arn' --output textual content)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa 
  eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true

This annotation lets the motive force know what IAM position it ought to use to work together with the FSx for Lustre service in your behalf.

Confirm that the service account has been correctly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the modifications to take impact:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two choices for provisioning a file system:

Dynamic provisioning – This feature makes use of Persistent Quantity Claims (PVCs) in Kubernetes. You outline a PVC with desired storage specs. The CSI driver robotically provisions the FSx for Lustre file system for you based mostly on the PVC request. This enables for easy scaling and eliminates the necessity to manually create file techniques.
Static provisioning – On this methodology, you manually create the FSx for Lustre file system earlier than utilizing the CSI driver. You’ll need to configure particulars like subnet ID and safety teams for the file system. Then, you should use the motive force to mount this pre-created file system inside your container as a quantity.

For this instance, we use dynamic provisioning. Begin by making a storage class that makes use of the fsx.csi.aws.com provisioner:

cat < storageclass.yaml
sort: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  title: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: ${SUBNET_ID}
  securityGroupIds: ${SECURITYGROUP_ID}
  deploymentType: PERSISTENT_2
  automaticBackupRetentionDays: "0"
  copyTagsToBackups: "true"
  perUnitStorageThroughput: "250"
  dataCompressionType: "LZ4"
  fileSystemTypeVersion: "2.12"
mountOptions:
  - flock
EOF

kubectl apply -f storageclass.yaml

SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Ought to be the identical personal subnet that was used for HyperPod creation.
SECURITYGROUP_ID: The safety group IDs that will likely be hooked up to the file system. Ought to be the identical Safety Group ID that’s utilized in HyperPod and EKS.

Subsequent, create a PVC that makes use of the fsx-claim storage declare:

cat < pvc.yaml
apiVersion: v1
sort: PersistentVolumeClaim
metadata:
  title: fsx-claim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: fsx-sc
  sources:
    requests:
      storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will begin the dynamic provisioning of an FSx for Lustre file system based mostly on the specs supplied within the storage class.

Create the Ray cluster

Now that we’ve got each the SageMaker HyperPod cluster and the FSx for Lustre file system created, we will arrange the Ray cluster:

Arrange dependencies. We are going to create a brand new namespace in our Kubernetes cluster and set up the KubeRay operator utilizing a Helm chart.

We advocate utilizing KubeRay operator model 1.2.0 or greater, which helps computerized Ray Pod eviction and alternative in case of failures (for instance, {hardware} points on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo replace
#Set up each CRDs and Kuberay operator v1.2.0
helm set up kuberay-operator kuberay/kuberay-operator --version 1.2.0 --namespace kuberay
# Kuberay operator pod will likely be deployed onto head pod
kubectl get pods --namespace kuberay

Create a Ray Container Picture for the Ray Cluster manifest. With the current deprecation of the `rayproject/ray-ml` pictures ranging from Ray model 2.31.0, it’s essential to create a {custom} container picture for our Ray cluster. Subsequently, we’ll construct on prime of the `rayproject/ray:2.42.1-py310-gpu` picture, which has all obligatory Ray dependencies, and embody our coaching dependencies to construct our personal {custom} picture. Please be happy to switch this Dockerfile as you would like.

First, create a Dockerfile that builds upon the bottom Ray GPU picture and contains solely the required dependencies:

cat < Dockerfile
 
FROM rayproject/ray:2.42.1-py310-gpu
# Set up Python dependencies for PyTorch, Ray, Hugging Face, and extra
RUN pip set up --no-cache-dir 
    torch torchvision torchaudio 
    numpy 
    pytorch-lightning 
    transformers datasets consider tqdm click on 
    ray[train] ray[air] 
    ray[train-torch] ray[train-lightning] 
    torchdata 
    torchmetrics 
    torch_optimizer 
    speed up 
    scikit-learn 
    Pillow==9.5.0 
    protobuf==3.20.3
 
RUN pip set up --upgrade datasets transformers
 
# Set the person
USER ray
WORKDIR /house/ray
 
# Confirm ray set up
RUN which ray && 
    ray –-version
  
# Default command
CMD [ "/bin/bash" ]
 
EOF

Then, construct and push the picture to your container registry (Amazon ECR) utilizing the supplied script:

export AWS_REGION=$(aws configure get area)
export ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
 
echo "This course of might take 10-Quarter-hour to finish..."
 
echo "Constructing picture..."
 
docker construct --platform linux/amd64 -t ${REGISTRY}aws-ray-custom:newest .
 
# Create registry if wanted
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "aws-ray-custom" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
    aws ecr create-repository --repository-name aws-ray-custom
fi
 
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password --region $AWS_REGION| docker login --username AWS --password-stdin $REGISTRY
 
echo "Pushing picture to $REGISTRY ..."
 
# Push picture to registry
docker picture push ${REGISTRY}aws-ray-custom:newest

Now, our Ray container picture is in Amazon ECR with all obligatory Ray dependencies, in addition to code library dependencies.

Create a Ray cluster manifest. We use a Ray cluster to host our coaching jobs. The Ray cluster is the first useful resource for managing Ray situations on Kubernetes. It represents a cluster of Ray nodes, together with a head node and a number of employee nodes. The Ray cluster CRD determines how the Ray nodes are arrange, how they impart, and the way sources are allotted amongst them. The nodes in a Ray cluster manifest as pods within the EKS or SageMaker HyperPod cluster.

Notice that there are two distinct sections within the cluster manifest. Whereas the `headGroupSpec` defines the top node of the Ray Cluster, the `workerGroupSpecs` outline the employee nodes of the Ray Cluster. Whereas a job might technically run on the Head node as properly, it’s common to separate the top node from the precise employee nodes the place jobs are executed. Subsequently, the occasion for the top node can sometimes be a smaller occasion (i.e. we selected a m5.2xlarge). For the reason that head node additionally manages cluster-level metadata, it may be helpful to have it run on a non-GPU node to attenuate the danger of node failure (as GPU generally is a potential supply of node failure).

cat <<'EOF' > raycluster.yaml
apiVersion: ray.io/v1alpha1
sort: RayCluster
metadata:
  title: rayml
  labels:
    controller-tools.k8s.io: "1.0"
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray begin` command.
    # See https://github.com/ray-project/kuberay/blob/grasp/docs/steerage/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/newest/cluster/cli.html#ray-start for all accessible choices in `rayStartParams`.
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        #        nodeSelector:  
        #node.kubernetes.io/instance-type: "ml.m5.2xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-head
          picture: ${REGISTRY}aws-ray-custom:newest     ## IMAGE: Right here you might select which picture your head pod will run
          env:                                ## ENV: Right here is the place you'll be able to ship stuff to the top pod
            - title: RAY_GRAFANA_IFRAME_HOST   ## PROMETHEUS AND GRAFANA
              worth: http://localhost:3000
            - title: RAY_GRAFANA_HOST
              worth: http://prometheus-grafana.prometheus-system.svc:80
            - title: RAY_PROMETHEUS_HOST
              worth: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits on your head pod
              cpu: 1
              reminiscence: 8Gi
            requests:                                    ## REQUESTS: Set useful resource requests on your head pod
              cpu: 1
              reminiscence: 8Gi
          ports:
          - containerPort: 6379
            title: gcs-server
          - containerPort: 8265 # Ray dashboard
            title: dashboard
          - containerPort: 10001
            title: consumer
          - containerPort: 8000
            title: serve
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: fsx-storage
            mountPath: /fsx
          - title: ray-logs
            mountPath: /tmp/ray
        volumes:
          - title: ray-logs
            emptyDir: {}
          - title: fsx-storage
            persistentVolumeClaim:
              claimName: fsx-claim
  workerGroupSpecs:
  # the pod replicas on this group typed employee
  - replicas: 4                                    ## REPLICAS: What number of employee pods you need 
    minReplicas: 1
    maxReplicas: 10
    # logical group title, for this known as small-group, additionally could be useful
    groupName: gpu-group
    rayStartParams:
      num-gpus: "8"
    #pod template
    template:
      spec:
        #nodeSelector:
        # node.kubernetes.io/instance-type: "ml.p5.48xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-worker
          picture: ${REGISTRY}aws-ray-custom:newest             ## IMAGE: Right here you might select which picture your head node will run
          env:
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits on your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32  
            requests:                                    ## REQUESTS: Set useful resource requests on your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: ray-logs
            mountPath: /tmp/ray
          - title: fsx-storage
            mountPath: /fsx
        volumes:
        - title: fsx-storage
          persistentVolumeClaim:
            claimName: fsx-claim
        - title: ray-logs
          emptyDir: {}
EOF

Deploy the Ray cluster:

envsubst < raycluster.yaml | kubectl apply -f -

Optionally, expose the Ray dashboard utilizing port forwarding:

# Will get title of kubectl service that runs the top pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk '{print $1}' | head -n 1)
# Port forwards the dashboard from the top pod service
kubectl port-forward --address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you’ll be able to go to http://localhost:8265/ to go to the Ray Dashboard.

To launch a coaching job, there are a number of choices:
1. Use the Ray jobs submission SDK, the place you’ll be able to submit jobs to the Ray cluster via the Ray dashboard port (8265 by default) the place Ray listens for job requests. To be taught extra, see Quickstart utilizing the Ray Jobs CLI.
2. Execute a Ray job within the head pod the place you exec immediately into the top pod after which submit your job. To be taught extra, see RayCluster Quickstart.

For this instance, we use the primary methodology and submit the job via the SDK. Subsequently, we merely run from an area setting the place the coaching code is offered in --working-dir. Relative to this path, we specify the principle coaching Python script situated at --train.py
Throughout the working-dir folder, we will additionally embody further scripts we’d have to run the coaching.

The fsdp-ray.py instance is situated in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py within the aws-do-ray GitHub repo.

# Inside jobs/ folder
ray job submit --address http://localhost:8265 --working-dir "fsdp-ray" -- python3 fsdp-ray.py

For our Python coaching script to run, we want to verify our coaching scripts are appropriately arrange to make use of Ray. This contains the next steps:

Configure a mannequin to run distributed and on the right CPU/GPU gadget
Configure an information loader to shard information throughout the staff and place information on the right CPU or GPU gadget
Configure a coaching perform to report metrics and save checkpoints
Configure scaling and CPU or GPU useful resource necessities for a coaching job
Launch a distributed coaching job with a TorchTrainer class

For additional particulars on the right way to regulate your present coaching script to get essentially the most out of Ray, consult with the Ray documentation.

The next diagram illustrates the entire structure you’ve constructed after finishing these steps.

Implement coaching job resiliency with the job auto resume performance

Ray is designed with sturdy fault tolerance mechanisms to offer resilience in distributed techniques the place failures are inevitable. These failures usually fall into two classes: application-level failures, which stem from bugs in person code or exterior system points, and system-level failures, brought on by node crashes, community disruptions, or inside bugs in Ray. To deal with these challenges, Ray gives instruments and techniques that allow functions to detect, get better, and adapt seamlessly, offering reliability and efficiency in distributed environments. On this part, we take a look at two of the commonest kinds of failures, and the right way to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Practice employee failures and Ray employee node failures.

Ray Practice employee – It is a employee course of particularly used for coaching duties inside Ray Practice, Ray’s distributed coaching library. These staff deal with particular person duties or shards of a distributed coaching job. Every employee is chargeable for processing a portion of the info, coaching a subset of the mannequin, or performing computation throughout distributed coaching. They’re coordinated by the Ray Practice orchestration logic to collectively prepare a mannequin.
Ray employee node – On the Ray degree, this can be a Ray node in a Ray cluster. It’s a part of the Ray cluster infrastructure and is chargeable for working duties, actors, and different processes as orchestrated by the Ray head node. Every employee node can host a number of Ray processes that execute duties or handle distributed objects. On the Kubernetes degree, a Ray employee node is a Kubernetes pod that’s managed by a KubeRay operator. For this submit, we will likely be speaking concerning the Ray employee nodes on the Kubernetes degree, so we’ll consult with them as pods.

On the time of writing, there are not any official updates concerning head pod fault tolerance and auto resume capabilities. Although head pod failures are uncommon, within the unlikely occasion of such a failure, you will want to manually restart your coaching job. Nonetheless, you’ll be able to nonetheless resume progress from the final saved checkpoint. To reduce the danger of hardware-related head pod failures, it’s suggested to position the top pod on a devoted, CPU-only SageMaker HyperPod node, as a result of GPU failures are a standard coaching job failure level.

Ray Practice employee failures

Ray Practice is designed with fault tolerance to deal with employee failures, equivalent to RayActorErrors. When a failure happens, the affected staff are stopped, and new ones are robotically began to take care of operations. Nonetheless, for coaching progress to proceed seamlessly after a failure, saving and loading checkpoints is important. With out correct checkpointing, the coaching script will restart, however all progress will likely be misplaced. Checkpointing is due to this fact a crucial part of Ray Practice’s fault tolerance mechanism and must be applied in your code.

Automated restoration

When a failure is detected, Ray shuts down failed staff and provisions new ones. Though this occurs, we will inform the coaching perform to at all times hold retrying till coaching can proceed. Every occasion of restoration from a employee failure is taken into account a retry. We will set the variety of retries via the max_failures attribute of the FailureConfig, which is ready within the RunConfig handed to the Coach (for instance, TorchTrainer). See the next code:

from ray.prepare import RunConfig, FailureConfig
# Tries to get better a run as much as this many occasions.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No restrict on the variety of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For extra info, see Dealing with Failures and Node Preemption.

Checkpoints

A checkpoint in Ray Practice is a light-weight interface representing a listing saved both domestically or remotely. For instance, a cloud-based checkpoint would possibly level to s3://my-bucket/checkpoint-dir, and an area checkpoint would possibly level to /tmp/checkpoint-dir. To be taught extra, see Saving checkpoints throughout coaching.

To avoid wasting a checkpoint within the coaching loop, you first want to jot down your checkpoint to an area listing, which could be non permanent. When saving, you should use checkpoint utilities from different frameworks like torch.save, pl.Coach.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Mannequin.save, and extra. You then create a checkpoint from the listing utilizing Checkpoint.from_directory. Lastly, report the checkpoint to Ray Practice utilizing ray.prepare.report(metrics, checkpoint=...). The metrics reported alongside the checkpoint are used to maintain observe of the best-performing checkpoints. Reporting will add the checkpoint to persistent storage.

In case you save checkpoints with ray.prepare.report(..., checkpoint=...) and run on a multi-node cluster, Ray Practice will increase an error if NFS or cloud storage isn’t arrange. It’s because Ray Practice expects all staff to have the ability to write the checkpoint to the identical persistent storage location.

Lastly, clear up the native non permanent listing to unlock disk house (for instance, by exiting the tempfile.TemporaryDirectory context). We will save a checkpoint each epoch or each few iterations.

The next diagram illustrates this setup.

The next code is an instance of saving checkpoints utilizing native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.prepare.torch
from ray import prepare
from ray.prepare import Checkpoint, ScalingConfig
from ray.prepare.torch import TorchTrainer


def train_func(config):
    n = 100
    # create a toy dataset
    # information   : X - dim = (n, 4)
    # goal : Y - dim = (n, 1)
    X = torch.Tensor(np.random.regular(0, 1, dimension=(n, 4)))
    Y = torch.Tensor(np.random.uniform(0, 1, dimension=(n, 1)))
    # toy neural community : 1-layer
    # Wrap the mannequin in DDP
    mannequin = ray.prepare.torch.prepare_model(nn.Linear(4, 1))
    criterion = nn.MSELoss()

    optimizer = Adam(mannequin.parameters(), lr=3e-4)
    for epoch in vary(config["num_epochs"]):
        y = mannequin.ahead(X)
        loss = criterion(y, Y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        metrics = {"loss": loss.merchandise()}

        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            checkpoint = None

            should_checkpoint = epoch % config.get("checkpoint_freq", 1) == 0
            # In normal DDP coaching, the place the mannequin is identical throughout all ranks,
            # solely the worldwide rank 0 employee wants to save lots of and report the checkpoint
            if prepare.get_context().get_world_rank() == 0 and should_checkpoint:
                torch.save(
                    mannequin.module.state_dict(),  # NOTE: Unwrap the mannequin.
                    os.path.be part of(temp_checkpoint_dir, "mannequin.pt"),
                )
                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            prepare.report(metrics, checkpoint=checkpoint)


coach = TorchTrainer(
    train_func,
    train_loop_config={"num_epochs": 5},
    scaling_config=ScalingConfig(num_workers=2),
)
outcome = coach.match()

Ray Practice additionally comes with CheckpointConfig, a method to configure checkpointing choices:

from ray.prepare import RunConfig, CheckpointConfig
# Instance 1: Solely hold the two *most up-to-date* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Instance 2: Solely hold the two *finest* checkpoints and delete the others.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        # *Greatest* checkpoints are decided by these params:
        checkpoint_score_attribute="mean_accuracy",
        checkpoint_score_order="max",
    ),
    # This may retailer checkpoints on S3.
    storage_path="s3://remote-bucket/location",
)

To restore coaching state from a checkpoint in case your coaching job had been to fail and retry, it is best to modify your coaching loop to auto resume after which restore a Ray Practice job. By pointing to the trail of your saved checkpoints, you’ll be able to restore your coach and proceed coaching. Right here’s a fast instance:

from ray.prepare.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
    path="~/ray_results/dl_trainer_restore",  # May also be a cloud storage path like S3
    datasets=get_datasets(),
)
outcome = restored_trainer.match()

To streamline restoration, you’ll be able to add auto resume logic to your script. This checks if a legitimate experiment listing exists and restores the coach if accessible. If not, it begins a brand new experiment:

experiment_path = "~/ray_results/dl_restore_autoresume"
if TorchTrainer.can_restore(experiment_path):
    coach = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
    coach = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        datasets=get_datasets(),
        scaling_config=prepare.ScalingConfig(num_workers=2),
        run_config=prepare.RunConfig(
            storage_path="~/ray_results",
            title="dl_restore_autoresume",
        ),
    )
outcome = coach.match()

To summarize, to offer fault tolerance and auto resume when utilizing Ray Practice libraries, set your max_failures parameter within the FailureConfig (we advocate setting it to -1 to verify it would hold retrying till the SageMaker HyperPod node is rebooted or changed), and be sure you have enabled checkpointing in your code.

Ray employee pod failures

Along with the aforementioned mechanisms to get better from Ray Practice employee failures, Ray additionally gives fault tolerance on the employee pod degree. When a employee pod fails (this contains situations during which the raylet course of fails), the working duties and actors on it would fail and the objects owned by employee processes of this pod will likely be misplaced. On this case, the duties, actors, and objects fault tolerance mechanisms will begin and attempt to get better the failures utilizing different employee pods.

These mechanisms will likely be implicitly dealt with by the Ray Practice library. To be taught extra concerning the underlying fault tolerance on the duties, actors, objects (applied on the Ray Core degree), see Fault Tolerance.

In observe, because of this in case of a employee pod failure, the next happens:

If there’s a free employee pod within the Ray cluster, Ray will get better the failed employee pod by changing it with the free employee pod.
If there isn’t a free employee pod, however within the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a brand new employee pod onto one of many free SageMaker HyperPod nodes. This pod will be part of the working Ray cluster and the failure will likely be recovered utilizing this new employee pod.

Within the context of KubeRay, Ray employee nodes are represented by Kubernetes pods, and failures at this degree can embody points equivalent to pod eviction or preemption brought on by software-level components.

Nonetheless, one other crucial state of affairs to think about is {hardware} failures. If the underlying SageMaker HyperPod node turns into unavailable because of a {hardware} concern, equivalent to a GPU error, it will inevitably trigger the Ray employee pod working on that node to fail as properly. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster begin and can reboot or exchange the defective node. After the brand new wholesome node is added into the SageMaker HyperPod cluster, Ray will schedule a brand new employee pod onto the SageMaker HyperPod node and get better the interrupted coaching. On this case, each the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency options work collectively seamlessly and be sure that even in case of a {hardware} failure, your ML coaching workload can auto resume and decide up from the place it was interrupted.

As you’ve seen, there are numerous built-in resiliency and fault-tolerance mechanisms that permit your Ray Practice workload on SageMaker HyperPod to get better and auto resume. As a result of these mechanisms will basically get better by restarting the coaching job, it’s essential that checkpointing is applied within the coaching script. It’s also usually suggested to save lots of the checkpoints on a shared and protracted path, equivalent to an Amazon Easy Storage Service (Amazon S3) bucket or FSx for Lustre file system.

Clear up

To delete your SageMaker HyperPod cluster created on this submit, you’ll be able to both use the SageMaker AI console or use the next AWS CLI command:

aws sagemaker delete-cluster --cluster-name

Cluster deletion will take a couple of minutes. You’ll be able to affirm profitable deletion after you see no clusters on the SageMaker AI console.

In case you used the CloudFormation stack to create sources, you’ll be able to delete it utilizing the next command:

aws cloudformation delete-stack --stack-name

Conclusion

This submit demonstrated the right way to arrange and deploy Ray clusters on SageMaker HyperPod, highlighting key issues equivalent to storage configuration and fault tolerance and auto resume mechanisms.

Operating Ray jobs on SageMaker HyperPod affords a robust resolution for distributed AI/ML workloads, combining the flexibleness of Ray with the sturdy infrastructure of SageMaker HyperPod. This integration gives enhanced resiliency and auto resume capabilities, that are essential for long-running and resource-intensive duties. By utilizing Ray’s distributed computing framework and the built-in options of SageMaker HyperPod, you’ll be able to effectively handle advanced ML workflows, particularly coaching workloads as coated on this submit. As AI/ML workloads proceed to develop in scale and complexity, the mixture of Ray and SageMaker HyperPod affords a scalable, resilient, and environment friendly platform for tackling essentially the most demanding computational challenges in machine studying.

To get began with SageMaker HyperPod, consult with the Amazon EKS Assist in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Information. To be taught extra concerning the aws-do-ray framework, consult with the GitHub repo.

In regards to the Authors

Mark Vinciguerra is an Affiliate Specialist Options Architect at Amazon Net Companies (AWS) based mostly in New York. He focuses on the Automotive and Manufacturing sector, specializing in serving to organizations architect, optimize, and scale synthetic intelligence and machine studying options, with explicit experience in autonomous automobile applied sciences. Previous to AWS, he went to Boston College and graduated with a level in Laptop Engineering.

Florian Stahl is a Worldwide Specialist Options Architect at AWS, based mostly in Hamburg, Germany. He makes a speciality of Synthetic Intelligence, Machine Studying, and Generative AI options, serving to prospects optimize and scale their AI/ML workloads on AWS. With a background as a Knowledge Scientist, Florian focuses on working with prospects within the Autonomous Automobile house, bringing deep technical experience to assist organizations design and implement subtle machine studying options. He works intently with prospects worldwide to remodel their AI initiatives and maximize the worth of their machine studying investments on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in Gen AI mannequin coaching and inference. He’s partnering with prime basis mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop has held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.

Alex Iankoulski is a Principal Options Architect, ML/AI Frameworks, who focuses on serving to prospects orchestrate their AI workloads utilizing containers and accelerated computing infrastructure on AWS. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s greatest challenges.

Amazon SageMaker JumpStart provides fine-tuning help for fashions in a personal mannequin hub

Admin — Thu, 27 Mar 2025 08:43:21 +0000

Amazon SageMaker JumpStart is a machine studying (ML) hub that gives pre-trained fashions, resolution templates, and algorithms to assist builders rapidly get began with machine studying. Inside SageMaker JumpStart, the personal mannequin hub characteristic permits organizations to create their very own inner repository of ML fashions, enabling groups to share and handle fashions securely inside their group.

At the moment, we’re asserting an enhanced personal hub characteristic with a number of new capabilities that give organizations higher management over their ML belongings. These enhancements embody the flexibility to fine-tune SageMaker JumpStart fashions immediately inside the personal hub, help for including and managing custom-trained fashions, deep linking capabilities for related notebooks, and improved mannequin model administration. These new options streamline the ML workflow by combining the comfort of pre-built options with the pliability of {custom} improvement, whereas sustaining enterprise-grade safety and governance.

For enterprise clients, the flexibility to curate and fine-tune each pre-built and {custom} fashions is essential for profitable AI implementation. Mannequin curation offers high quality management, compliance, and safety whereas stopping duplicate efforts throughout groups. When enterprises fine-tune curated fashions, they’ll specialize general-purpose options for his or her particular business wants and acquire aggressive benefits by means of improved efficiency on their proprietary information. Equally, the flexibility to fine-tune {custom} fashions allows organizations to constantly enhance their AI options, adapt to altering enterprise circumstances, and protect institutional information, whereas sustaining cost-efficiency.

A typical enterprise state of affairs includes centralized information science groups creating basis fashions (FMs), evaluating the efficiency towards open supply FMs, and iterating on efficiency. After they develop their {custom} FM, it could actually function a baseline for the whole group, and particular person departments—equivalent to authorized, finance, or customer support—can fine-tune these fashions utilizing their department-specific information that could be topic to totally different privateness necessities or entry controls. This hub-and-spoke method to mannequin improvement maximizes useful resource effectivity whereas permitting for specialised optimization on the division stage. This complete method to mannequin administration, now supported by the improved personal hub options in SageMaker JumpStart, allows enterprises to steadiness standardization with customization whereas sustaining correct governance and management over their ML belongings.

Resolution overview

SageMaker JumpStart has launched a number of new enhancements to its personal mannequin hub characteristic, permitting directors higher management and adaptability in managing their group’s ML fashions. These enhancements embody:

Tremendous-tuning of fashions referenced within the personal hub – Directors can now add fashions from the SageMaker JumpStart catalog to their personal hub and fine-tune them utilizing Amazon SageMaker coaching jobs, with out having to create the fashions from scratch.
Assist for {custom} fashions – Along with the pre-trained SageMaker JumpStart fashions, directors can now add their very own custom-trained fashions to the personal hub and fine-tune them as wanted.
Deep linking of notebooks – Directors can now deep hyperlink to particular notebooks related to the fashions within the personal hub, making it easy for customers to entry and work with the fashions.
Updating fashions within the personal hub – The personal hub now helps updating fashions over time as new variations or iterations turn out to be accessible, permitting organizations to remain present with the most recent mannequin enhancements.

These new capabilities give AWS clients extra management over their ML infrastructure and allow quicker mannequin deployment and experimentation, whereas nonetheless sustaining the suitable entry controls and permissions inside their group.

Within the following sections, we offer steering on easy methods to use these new personal mannequin hub options utilizing the Amazon SageMaker SDK and Amazon SageMaker Studio console.

To be taught extra about easy methods to handle fashions utilizing personal hubs, see Handle Amazon SageMaker JumpStart basis mannequin entry with personal hubs.

Conditions

To make use of the SageMaker Python SDK and run the code related to this put up, you want the next conditions:

An AWS account that incorporates your AWS sources
An AWS Id and Entry Administration (IAM) position with entry to SageMaker Studio notebooks
SageMaker JumpStart enabled in a SageMaker Studio area

Create a personal hub, curate fashions, and configure entry management

This part offers a step-by-step information for directors to create a personal hub, curate fashions, and configure entry management on your group’s customers.

As a result of the characteristic has been built-in within the newest SageMaker Python SDK, to make use of the mannequin granular entry management characteristic with a personal hub, let’s first replace the SageMaker Python SDK:
```
!pip3 set up sagemaker —force-reinstall —quiet
```

Subsequent, import the SageMaker and Boto3 libraries:

import boto3 from sagemaker
import Session from sagemaker.session
import Hub

Configure your personal hub:

HUB_NAME="CompanyHub"
HUB_DISPLAY_NAME="Allowlisted Fashions"
HUB_DESCRIPTION="These are allowlisted fashions taken from the SageMaker Public Hub"
REGION="" # for instance, "us-west-2"

Within the previous code, HUB_NAME specifies the title of your hub. HUB_DISPLAY_NAME is the show title on your hub that will likely be proven to customers in UI experiences. HUB_DESCRIPTION is the outline on your hub that will likely be proven to customers.

Use an AWS Area the place SageMaker JumpStart is obtainable, as of March 2025: us-west-2, us-east-1, us-east-2, eu-west-1, eu-central-1, eu-central-2, eu-north-1, eu-south-2, me-south-1, me-central-1, ap-south-1, ap-south-2, eu-west-3, af-south-1, sa-east-1, ap-east-1, ap-northeast-2, ap-northeast-3, ap-southeast-3, ap-southeast-4, ap-southeast-5, ap-southeast-7, eu-west-2, eu-south-1, ap-northeast-1, us-west-1, ap-southeast-1, ap-southeast-2, ca-central-1, ca-west-1, cn-north-1, cn-northwest-1, il-central-1, mx-central-1, us-gov-east-1, us-gov-west-1.

Arrange a Boto3 shopper for SageMaker:

sm_client = boto3.shopper('sagemaker')
session = Session(sagemaker_client=sm_client)
session.get_caller_identity_arn()

Verify if the next insurance policies have been already added to your admin IAM position; if not, you’ll be able to add them as inline insurance policies (use the Area configured in Step 3):

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Action": [
                "s3:ListBucket",
                "s3:GetObject",
                "s3:GetObjectTagging"
            ],
            "Useful resource": [
                "arn:aws:s3:::jumpstart-cache-prod-",
                "arn:aws:s3:::jumpstart-cache-prod-/*"
            ],
            "Impact": "Enable"
        }
    ]
}

Along with organising IAM permissions to the admin position, you must scope down permissions on your customers to allow them to’t entry public contents.

Use the next coverage to disclaim entry to the general public hub on your customers. These may be added as inline insurance policies within the person’s IAM position (use the Area configured in Step 3):

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Action": "s3:*",
            "Effect": "Deny",
            "Resource": [
                "arn:aws:s3:::jumpstart-cache-prod-",
                "arn:aws:s3:::jumpstart-cache-prod-/*"
            ],
            "Situation": {
                "StringNotLike": {"s3:prefix": ["*.ipynb", "*/eula.txt"]}
            }
        },
        {
            "Motion": "sagemaker:*",
            "Impact": "Deny",
            "Useful resource": [
                "arn:aws:sagemaker::aws:hub/SageMakerPublicHub",
                "arn:aws:sagemaker::aws:hub-content/SageMakerPublicHub/*/*"
            ]
        }
    ]
}

After you might have arrange the personal hub configuration and permissions, you’re able to create the personal hub.

Use the next code to create the personal hub inside your AWS account within the Area you specified earlier:

hub = Hub(hub_name=HUB_NAME, sagemaker_session=session)

attempt:
  hub.create(
      description=HUB_DESCRIPTION,
      display_name=HUB_DISPLAY_NAME
  )
  print(f"Efficiently created Hub with title {HUB_NAME} in {REGION}")
besides Exception as e:
  if "ResourceInUse" in str(e):
    print(f"A hub with the title {HUB_NAME} already exists in your account.")
  else:
    elevate e

Use describe() to confirm the configuration of your hub. After your personal hub is ready up, you’ll be able to add a reference to fashions from the SageMaker JumpStart public hub to your personal hub. No mannequin artifacts must be managed by the shopper. The SageMaker staff will handle model or safety updates. For an inventory of accessible fashions, confer with Constructed-in Algorithms with pre-trained Mannequin Desk.

To go looking programmatically, run the next command:

from sagemaker.jumpstart.filters import Or

filter_value = Or(
"framework == meta",
"framework == deepseek"
)
fashions = []
next_token = None

whereas True:
    response = hub.list_sagemaker_public_hub_models(
        filter=filter_value,
        next_token=next_token
    )
    fashions.prolong(response["hub_content_summaries"])
    next_token = response.get("next_token")
    
    if not next_token:
        break
print(fashions)

The filter argument is elective. For an inventory of filters you’ll be able to apply, confer with the next GitHub repo.

Use the retrieved fashions from the previous command to create mannequin references on your personal hub:

for mannequin in fashions:
    print(f"Including {mannequin.get('hub_content_name')} to Hub")
    hub.create_model_reference(model_arn=mannequin.get("hub_content_arn"), 
                               model_name=mannequin.get("hub_content_name"))

The SageMaker JumpStart personal hub gives different helpful options for managing and interacting with the curated fashions. Directors can examine the metadata of a selected mannequin utilizing the hub.describe_model(model_name=) command. To listing the accessible fashions within the personal hub, you need to use a easy loop:

response = hub.list_models()
fashions = response["hub_content_summaries"]
whereas response["next_token"]:
    response = hub.list_models(next_token=response["next_token"])
    fashions.prolong(response["hub_content_summaries"])

for mannequin in fashions:
    print(mannequin.get('HubContentArn'))

If you must take away a selected mannequin reference from the personal hub, use the next command:

hub.delete_model_reference("")

If you wish to delete the personal hub out of your account and Area, you will have to delete all of the HubContents first, then delete the personal hub. Use the next code:

for mannequin in fashions:
    hub.delete_model_reference(model_name=mannequin.get('HubContentName'))
    
hub.delete()

Tremendous-tune fashions referenced within the personal hub

This part walks by means of easy methods to work together with allowlisted fashions in SageMaker JumpStart. We show easy methods to listing accessible fashions, establish a mannequin from the general public hub, and fine-tune the mannequin utilizing the SageMaker Python SDK in addition to the SageMaker Studio UI.

Person expertise utilizing the SageMaker Python SDK

To work together along with your fashions utilizing the SageMaker Python SDK, full the next steps:

Similar to the admin course of, step one is to pressure reinstall the SageMaker Python SDK:
```
!pip3 set up sagemaker —force-reinstall —quiet
```

When interacting with the SageMaker SDK features, add references to the hub_arn:

model_id="meta-vlm-llama-3-2-11b-vision"
model_version="2.1.8"
hub_arn=""

from sagemaker import hyperparameters

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version, hub_arn=hub_arn
)
print(my_hyperparameters)
hyperparameters.validate(
    model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters, hub_arn=hub_arn
)

You’ll be able to then begin a coaching job by specifying the mannequin ID, model, and hub title:

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    hub_name=hub_arn,
    model_version=model_version,
    atmosphere={"accept_eula": "false"},  # Please change {"accept_eula": "true"}
    disable_output_compression=True,
    instance_type="ml.p4d.24xlarge",
    hyperparameters=my_hyperparameters,
)
estimator.match({"coaching": train_data_location})

For a {custom} mannequin, see the instance notebooks in GitHub.

Person expertise in SageMaker Studio

Full the next steps to work together with allowlisted fashions utilizing SageMaker Studio:

On the SageMaker Studio console, select JumpStart within the navigation pane or within the Prebuilt and automatic options part.
Select one among mannequin hubs you might have entry to.

If the person has entry to a number of hubs, you will notice an inventory of hubs, as proven within the following screenshot.

If the person has entry to just one hub, you may be redirected to the mannequin listing.

To fine-tune a mannequin, select Practice (this selection will likely be enabled if it’s supported).
Modify your coaching job configurations like coaching information, occasion sort, and hyperparameters, and select Submit.

Deep hyperlink notebooks within the personal hub

Now you can additionally entry the pocket book related to the mannequin in your curated hub.

Select your mannequin, then select Preview notebooks.
Select Open in JupyterLab to begin the deep hyperlink workflow.
Choose a operating JupyterLab house and select Open pocket book.

You will want to improve your house to make use of a SageMaker distribution of at the least 2.4.1. For extra data on easy methods to improve your SageMaker distribution, see Replace the SageMaker Distribution Picture.

This can mechanically open the chosen pocket book in your JupyterLab occasion, along with your personal HubName inputted into the mandatory courses.

Replace fashions within the personal hub

Modify your current personal HubContent by calling the brand new sagemaker:UpdateHubContent API. Now you can replace an current HubContent model in-place with no need to delete and re-add it. We don’t help updating the HubContentDocument at the moment as a result of there may be backward-incompatible adjustments which might be launched that basically alter the efficiency and utilization of the mannequin itself. Consult with the general public API documentation for extra particulars.

shopper.update_hub_content(
    hub_content_name="my-model",
    hub_content_version="1.0.0",
    hub_content_type="Mannequin",
    hub_name="my-hub",
    support_status="DEPRECATED"
)

Moreover, you’ll be able to modify your ModelReferences by calling the brand new sagemaker:UpdateHubContentReference API. Consult with the general public API documentation for extra utilization particulars.

shopper.update_hub_content_reference(
    hub_content_name="your-model",
    hub_content_type="ModelReference",
    hub_name="my-hub",
    min_version="1.2.0"
)

Conclusion

This put up demonstrated the brand new enhancements to the SageMaker JumpStart personal mannequin hub characteristic, which provides enterprise clients higher management and adaptability in managing their ML belongings. The important thing capabilities launched embody the flexibility to fine-tune pre-built SageMaker JumpStart fashions immediately inside the personal hub, help for importing and fine-tuning custom-trained fashions, deep linking to related notebooks for streamlined entry and collaboration, and improved mannequin model administration by means of APIs. These options allow enterprises to curate a centralized repository of trusted, specialised ML fashions, whereas nonetheless offering the pliability for particular person groups and departments to fine-tune and adapt these fashions to their particular wants. The seamless integration with SageMaker Studio additional streamlines the mannequin improvement and deployment workflow, empowering enterprises to speed up their ML initiatives whereas sustaining the suitable safety and management over their ML belongings.

Now that you simply’ve seen how the improved personal mannequin hub options in Amazon SageMaker JumpStart may give your group higher management and adaptability over managing your machine studying belongings, begin leveraging these capabilities to curate a centralized repository of trusted fashions and speed up your AI initiatives.

In regards to the Authors

Marc Karp is an ML Architect with the Amazon SageMaker Service staff. He focuses on serving to clients design, deploy, and handle ML workloads at scale. In his spare time, he enjoys touring and exploring new locations.

Niris Okram is a senior tutorial analysis specialist options architect at AWS. He has in depth expertise working with public, personal and analysis clients on varied fields associated to cloud. He’s obsessed with designing and constructing techniques to speed up the shopper’s mission on AWS cloud.

Benjamin Crabtree is a software program engineer with the Amazon SageMaker and Bedrock groups. He’s obsessed with democratizing the brand new and frequent breakthroughs in AI. Ben obtained his undergraduate diploma from the College of Michigan and now lives in Brooklyn, NY.

Banu Nagasundaram leads product, engineering, and strategic partnerships for SageMaker JumpStart, SageMaker’s machine studying and GenAI hub. She is obsessed with constructing options that assist clients speed up their AI journey and unlock enterprise worth.