MaxText Expands Put up-Coaching Capabilities: Introducing SFT and RL on Single-Host TPUs

Within the quickly evolving panorama of enormous language fashions (LLMs), pre-training is just step one. To rework a base mannequin right into a specialised assistant or a high-performing reasoning engine, post-training is important. At this time, we’re excited to announce new options in MaxText that streamline this course of: Supervised Effective-Tuning (SFT) and Reinforcement Studying (RL) now obtainable on single-host TPU configurations (similar to v5p-8 and v6e-8).

By leveraging the facility of JAX and the effectivity of the Tunix library, MaxText gives a high-performance, scalable path for builders to refine their fashions utilizing the newest post-training strategies. You’ll be able to discover the total documentation for SFT and RL to start out your post-training journey on TPUs at this time.

Supervised Effective-Tuning (SFT): Precision Tuning Made Easy

Supervised Effective-Tuning is the first technique for adapting a pre-trained mannequin to comply with particular directions or excel at area of interest duties. With the brand new single-host SFT help, customers can now take an current MaxText or Hugging Face checkpoint and fine-tune it on labeled datasets with minimal setup.

Key Highlights:

Seamless Integration: Native help for Hugging Face datasets (e.g., ultrachat_200k).
Versatile Checkpoints: Use current MaxText checkpoints or convert Hugging Face fashions (like Gemma 3) instantly inside the ecosystem.
Optimized Execution: Powered by Tunix, a JAX-based library particularly designed for post-training effectivity.

Reinforcement Studying (RL): Advancing Reasoning Capabilities

For duties requiring complicated logic and reasoning—similar to math or coding—Reinforcement Studying is a game-changer. MaxText now helps a number of state-of-the-art RL algorithms on single-host TPUs, using vLLM for high-throughput inference throughout the coaching loop. For instance,

Group Relative Coverage Optimization (GRPO) GRPO is a memory-efficient variant of PPO (Proximal Coverage Optimization). It eliminates the necessity for a separate worth operate mannequin, as an alternative producing a number of responses per immediate and calculating relative benefits inside the group. This considerably reduces the {hardware} footprint, making superior RL accessible on a single TPU host.
Group Sequence Coverage Optimization (GSPO) GSPO focuses on sequence-level significance ratios and clipping. It improves coaching stability and effectivity by rewarding mannequin habits on the sequence degree, making it significantly efficient for enhancing efficiency on benchmarks like GSM8K.

Getting Began

To start utilizing these new options, guarantee you could have the newest post-training dependencies put in:

uv pip set up maxtext[tpu-post-train]==0.2.1 --resolution=lowest
install_maxtext_tpu_post_train_extra_deps

Shell

Operating SFT:

You’ll be able to launch an SFT run utilizing the train_sft module, specifying your mannequin, dataset, and output listing:

python3 -m maxtext.trainers.post_train.sft.train_sft 
   model_name=${MODEL?} 
   load_parameters_path=${MAXTEXT_CKPT_PATH?} 
   run_name=${RUN_NAME?} 
   base_output_directory=${BASE_OUTPUT_DIRECTORY?}

Shell

Operating RL (GRPO/GSPO):

For RL, the train_rl module handles the loading of coverage and reference fashions, executes the coaching, and gives automated analysis on reasoning benchmarks:

python3 -m maxtext.trainers.post_train.rl.train_rl 
  model_name=${MODEL?} 
  load_parameters_path=${MAXTEXT_CKPT_PATH?} 
  run_name=${RUN_NAME?} 
  base_output_directory=${BASE_OUTPUT_DIRECTORY?} 
  loss_algo=gspo-token 
  chips_per_vm=${CHIPS_PER_VM?}

Shell

What’s Subsequent?

Whereas single-host help gives a robust entry level for a lot of builders, MaxText is constructed for scale. These identical workflows are designed to transition seamlessly to multi-host configurations for these coaching bigger fashions and using huge datasets. Please keep tuned for extra updates on this path from us sooner or later.