{"id":14631,"date":"2026-05-10T15:44:17","date_gmt":"2026-05-10T15:44:17","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=14631"},"modified":"2026-05-10T15:44:17","modified_gmt":"2026-05-10T15:44:17","slug":"maxtext-expands-put-up-coaching-capabilities-introducing-sft-and-rl-on-single-host-tpus","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=14631","title":{"rendered":"MaxText Expands Put up-Coaching Capabilities: Introducing SFT and RL on Single-Host TPUs"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p data-block-key=\"qa03p\">Within the quickly evolving panorama of enormous language fashions (LLMs), pre-training is just step one. To rework a base mannequin right into a specialised assistant or a high-performing reasoning engine, post-training is important. At this time, we&#8217;re excited to announce new options in <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/AI-Hypercomputer\/maxtext\">MaxText<\/a> that streamline this course of: <b>Supervised Effective-Tuning (SFT)<\/b> and <b>Reinforcement Studying (RL)<\/b> now obtainable on single-host TPU configurations (similar to v5p-8 and v6e-8).<\/p>\n<p data-block-key=\"c3kr5\">By leveraging the facility of JAX and the effectivity of the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google\/tunix\/tree\/main\">Tunix<\/a> library, MaxText gives a high-performance, scalable path for builders to refine their fashions utilizing the newest post-training strategies. You&#8217;ll be able to discover the total documentation for <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/maxtext.readthedocs.io\/en\/maxtext-v0.2.1\/tutorials\/posttraining\/sft.html\">SFT<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/maxtext.readthedocs.io\/en\/maxtext-v0.2.1\/tutorials\/posttraining\/rl.html\">RL<\/a> to start out your post-training journey on TPUs at this time.<\/p>\n<h3 data-block-key=\"ft5uk\" id=\"supervised-fine-tuning-(sft):-precision-tuning-made-simple\"><b>Supervised Effective-Tuning (SFT): Precision Tuning Made Easy<\/b><\/h3>\n<p data-block-key=\"ac7tu\">Supervised Effective-Tuning is the first technique for adapting a pre-trained mannequin to comply with particular directions or excel at area of interest duties. With the brand new single-host SFT help, customers can now take an current MaxText or Hugging Face checkpoint and fine-tune it on labeled datasets with minimal setup.<\/p>\n<p data-block-key=\"fc6fo\"><b>Key Highlights:<\/b><\/p>\n<ul>\n<li data-block-key=\"e959h\"><b>Seamless Integration:<\/b> Native help for Hugging Face datasets (e.g., ultrachat_200k).<\/li>\n<li data-block-key=\"8ml3h\"><b>Versatile Checkpoints:<\/b> Use current MaxText checkpoints or convert Hugging Face fashions (like Gemma 3) instantly inside the ecosystem.<\/li>\n<li data-block-key=\"9cds9\"><b>Optimized Execution:<\/b> Powered by Tunix, a JAX-based library particularly designed for post-training effectivity.<\/li>\n<\/ul>\n<h3 data-block-key=\"7na1g\" id=\"reinforcement-learning-(rl):-advancing-reasoning-capabilities\"><b>Reinforcement Studying (RL): Advancing Reasoning Capabilities<\/b><\/h3>\n<p data-block-key=\"8cols\">For duties requiring complicated logic and reasoning\u2014similar to math or coding\u2014Reinforcement Studying is a game-changer. MaxText now helps a number of state-of-the-art RL algorithms on single-host TPUs, using <b>vLLM<\/b> for high-throughput inference throughout the coaching loop. For instance,<\/p>\n<ol>\n<li data-block-key=\"9008o\"><b>Group Relative Coverage Optimization (GRPO)<\/b> GRPO is a memory-efficient variant of PPO (Proximal Coverage Optimization). It eliminates the necessity for a separate worth operate mannequin, as an alternative producing a number of responses per immediate and calculating relative benefits inside the group. This considerably reduces the {hardware} footprint, making superior RL accessible on a single TPU host.<\/li>\n<li data-block-key=\"2868s\"><b>Group Sequence Coverage Optimization (GSPO)<\/b> GSPO focuses on sequence-level significance ratios and clipping. It improves coaching stability and effectivity by rewarding mannequin habits on the sequence degree, making it significantly efficient for enhancing efficiency on benchmarks like GSM8K.<\/li>\n<\/ol>\n<h3 data-block-key=\"r5tvs\" id=\"getting-started\"><b>Getting Began<\/b><\/h3>\n<p data-block-key=\"42c9\">To start utilizing these new options, guarantee you could have the newest post-training dependencies put in:<\/p>\n<\/div>\n<div>\n<pre><code class=\"language-shell\">uv pip set up maxtext[tpu-post-train]==0.2.1 --resolution=lowest&#13;\ninstall_maxtext_tpu_post_train_extra_deps<\/code><\/pre>\n<p>\n        Shell\n    <\/p>\n<\/div>\n<div>\n<h4 data-block-key=\"u2bfq\" id=\"running-sft:\"><b>Operating SFT:<\/b><\/h4>\n<p data-block-key=\"6hb1l\">You&#8217;ll be able to launch an SFT run utilizing the train_sft module, specifying your mannequin, dataset, and output listing:<\/p>\n<\/div>\n<div>\n<pre><code class=\"language-shell\">python3 -m maxtext.trainers.post_train.sft.train_sft &#13;\n   model_name=${MODEL?} &#13;\n   load_parameters_path=${MAXTEXT_CKPT_PATH?} &#13;\n   run_name=${RUN_NAME?} &#13;\n   base_output_directory=${BASE_OUTPUT_DIRECTORY?}<\/code><\/pre>\n<p>\n        Shell\n    <\/p>\n<\/div>\n<div>\n<h4 data-block-key=\"t6ar5\" id=\"\"><b>Operating RL (GRPO\/GSPO):<\/b><\/h4>\n<p data-block-key=\"867d5\">For RL, the train_rl module handles the loading of coverage and reference fashions, executes the coaching, and gives automated analysis on reasoning benchmarks:<\/p>\n<\/div>\n<div>\n<pre><code class=\"language-shell\">python3 -m maxtext.trainers.post_train.rl.train_rl &#13;\n  model_name=${MODEL?} &#13;\n  load_parameters_path=${MAXTEXT_CKPT_PATH?} &#13;\n  run_name=${RUN_NAME?} &#13;\n  base_output_directory=${BASE_OUTPUT_DIRECTORY?} &#13;\n  loss_algo=gspo-token &#13;\n  chips_per_vm=${CHIPS_PER_VM?}<\/code><\/pre>\n<p>\n        Shell\n    <\/p>\n<\/div>\n<div>\n<h3 data-block-key=\"86gcl\" id=\"what's-next\"><b>What\u2019s Subsequent?<\/b><\/h3>\n<p data-block-key=\"a5onj\">Whereas single-host help gives a robust entry level for a lot of builders, MaxText is constructed for scale. These identical workflows are designed to transition seamlessly to multi-host configurations for these coaching bigger fashions and using huge datasets. Please keep tuned for extra updates on this path from us sooner or later.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Within the quickly evolving panorama of enormous language fashions (LLMs), pre-training is just step one. To rework a base mannequin right into a specialised assistant or a high-performing reasoning engine, post-training is important. At this time, we&#8217;re excited to announce new options in MaxText that streamline this course of: Supervised Effective-Tuning (SFT) and Reinforcement Studying [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":14633,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[610,3550,979,9025,9026,9027,9028,7308],"class_list":["post-14631","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-capabilities","tag-expands","tag-introducing","tag-maxtext","tag-posttraining","tag-sft","tag-singlehost","tag-tpus"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14631","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=14631"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14631\/revisions"}],"predecessor-version":[{"id":14632,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/14631\/revisions\/14632"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/14633"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=14631"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=14631"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=14631"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-10 17:46:38 UTC -->