The artwork and science of hyperparameter optimization on Amazon Nova Forge

Massive language fashions (LLMs) ship robust outcomes on normal duties, however they usually wrestle with specialised work that requires understanding proprietary knowledge, inner processes, or domain-specific terminology. Amazon Nova Forge addresses this by enabling you to construct your individual frontier fashions utilizing Amazon Nova. You can begin improvement from early mannequin checkpoints, mix proprietary knowledge with Amazon Nova-curated coaching knowledge, and host customized fashions securely on AWS. A key functionality is knowledge mixing, which blends your coaching knowledge with curated datasets. This helps the mannequin take in your area whereas retaining broad reasoning, instruction-following, and language capabilities. This prevents catastrophic forgetting that usually undermines area customization.

Profitable customization requires cautious hyperparameter tuning. Studying price, knowledge mixing ratio, checkpoint choice, and coaching methods all work together in methods that may silently undermine a coaching run. If any of them are improper, you commerce one drawback for one more. This put up covers the artwork (strategic trade-offs) and science (metric-driven choices) of hyperparameter tuning on Amazon Nova Forge that can assist you keep away from costly failed coaching runs.

Positive-tuning for domain-specific duties means bettering efficiency in a single space with out degrading the mannequin’s normal capabilities, and getting that steadiness proper is tougher than it seems. This put up walks by means of the right way to navigate that steadiness, from deciding on the appropriate customization technique to your knowledge and process, to configuring the coaching parameters that almost all affect outcomes, like studying price, batch measurement, and checkpointing. We additionally cowl the frequent errors that result in wasted coaching runs and the right way to catch them early, so you possibly can enhance area efficiency with out degrading normal capabilities or burning by means of compute on avoidable failures.

By the top, you’ll know the right way to enhance area efficiency with out degrading normal capabilities and the right way to keep away from the costly failures that come from getting the steadiness improper.

The hyperparameter tuning problem

Reaching this steadiness is tougher than it seems. Three basic challenges make hyperparameter tuning notably troublesome on domain-specialized fashions.

Problem 1: Catastrophic forgetting

Once you practice a mannequin on slender area knowledge, the mannequin can overwrite normal capabilities it realized throughout pre-training. This phenomenon, known as catastrophic forgetting, exhibits up as degraded efficiency on duties exterior your coaching area. The mannequin turns into extremely specialised however loses instruction-following skill, reasoning functionality, and broad data. In manufacturing, this implies a customer support mannequin fine-tuned in your assist tickets might not motive about ambiguous requests or keep coherent multi-turn conversations.

This creates a stability-flexibility tradeoff. Ideally, the mannequin is versatile sufficient to study a company’s area however secure sufficient to retain normal capabilities. Nova Forge addresses this by means of knowledge mixing, which blends your coaching knowledge with curated datasets throughout coaching, and checkpoint choice, which helps you to select how a lot present alignment to protect.

Problem 2: Discovering the appropriate studying price

The educational price controls how a lot the mannequin’s weights change in response to every batch of coaching examples. It’s essentially the most delicate hyperparameter throughout all customization methods. A studying price that’s too excessive causes the mannequin to overshoot the optimum state, destabilize throughout coaching, or overlook base capabilities quickly. A studying price that’s too low wastes compute on very gradual convergence. The fitting worth is determined by your knowledge distribution, mixing ratio, and coaching method.

Nova Forge gives calibrated service defaults for every coaching method that account for these interactions. Once you use knowledge mixing, the sensitivity will increase additional. Deviating from the default studying price when mixing Nova knowledge with your individual knowledge is the commonest supply of coaching instability, so these service defaults are the advisable place to begin.

Problem 3: Baseline efficiency constraints

Reinforcement fine-tuning (RFT) is a method that improves mannequin habits by producing a number of candidate responses and scoring them in opposition to high quality standards. The mannequin learns by evaluating its personal outputs and reinforcing the higher ones. RFT works at its full capability inside a particular vary of baseline process accuracy, measured by how usually the mannequin produces right or high-quality responses earlier than fine-tuning. If baseline accuracy is simply too low (the mannequin hardly ever produces right responses), there aren’t sufficient good examples for reward-guided exploration to study from. If baseline accuracy is already very excessive, further coaching yields diminishing returns and dangers degrading present efficiency. This implies RFT can’t shut massive competence gaps the place the mannequin basically lacks the data or reasoning skill to aim a process. It refines and strengthens behaviors the mannequin can already partially exhibit, reasonably than educating totally new capabilities from scratch.

The Nova Forge pipeline addresses each bounds. For low-baseline eventualities, run supervised fine-tuning (SFT) first to determine the foundational capabilities wanted for efficient reward-based studying. For top-baseline duties, be sure that your reward perform has discriminative energy throughout the mannequin’s high quality vary. If most responses already rating extremely, RFT has no significant sign to optimize in opposition to.

The Nova Forge customization pipeline

Understanding these challenges frames how the Amazon Nova Forge customization pipeline is designed to deal with them. Nova Forge gives three complementary customization methods, every serving a definite goal within the mannequin improvement lifecycle.

Approach	What it does	When to make use of	Enter knowledge
Continued pre-training (CPT)	Expands foundational mannequin (FM) data by means of self-supervised studying on massive portions of unlabeled, domain-specific proprietary knowledge. CPT teaches the mannequin area terminology and patterns out of your textual content corpus.	You want the mannequin to grasp specialised vocabulary, trade ideas, or organizational data that doesn’t exist within the base mannequin.	Massive volumes of unlabeled area textual content. Nova Forge helps CPT with knowledge mixing and three checkpoint choices (pre-trained, mid-trained, and post-trained), every suited to completely different knowledge scales and downstream necessities.
Supervised fine-tuning (SFT)	Customizes mannequin habits utilizing a coaching dataset of input-output pairs particular to your goal duties. SFT teaches the mannequin “given X, output Y” habits by means of demonstrations.	You want the mannequin to observe particular response codecs, undertake specific tones, or carry out structured duties like classification or extraction.	1,000–10,000 high-quality demonstrations per process. High quality, consistency, and variety matter greater than quantity. Nova Forge helps SFT with knowledge mixing utilizing Amazon Nova-curated datasets, together with reasoning-instruction-following classes that protect normal capabilities.
Reinforcement fine-tuning (RFT)	Steers mannequin output towards most popular outcomes utilizing reward indicators. RFT optimizes the mannequin inside a behavioral neighborhood established by prior coaching for single-turn or multi-turn conversational duties.	You could have a transparent reward perform that may consider response high quality and wish to push efficiency past what SFT alone achieves.	Prompts and a reward perform. Nova Forge helps bringing your individual exterior reward surroundings by means of AWS Lambda, enabling customized verification logic for domain-specific high quality evaluation.

When all three phases are used collectively (CPT, then SFT, then RFT), they produce the strongest outcomes. Nevertheless, with the appropriate pipeline, every stage may be optionally available. It is determined by your knowledge availability, process kind, and place to begin. CPT is simply wanted when the bottom mannequin lacks area vocabulary or data your process requires. SFT and RFT can be utilized independently or mixed relying on what your process calls for.

Determine 1: The Amazon Nova Forge customization pipeline. CPT teaches area data from unlabeled textual content, SFT teaches task-specific habits from demonstrations, and RFT optimizes efficiency utilizing reward indicators. Every stage is optionally available, and the complete pipeline (CPT, then SFT, then RFT) produces the strongest outcomes when all three are relevant to your use case.

Amazon SageMaker AI presents completely different environments for personalization: SageMaker Serverless gives a UI-driven expertise with computerized compute provisioning, SageMaker AI coaching jobs (SMTJ) present a totally managed expertise with out cluster administration, whereas Amazon SageMaker HyperPod presents specialised environments for superior distributed coaching eventualities.

Strategic choices

With the customization pipeline in view, the subsequent step is knowing the qualitative trade-offs that form your configuration. These strategic choices matter as a lot as any particular person hyperparameter worth: checkpoint choice, knowledge mixing, and coaching mode.

Checkpoint choice (most impactful resolution)

For CPT, checkpoint choice is extra impactful than any hyperparameter. Amazon Nova Forge gives three checkpoint choices, every suited to completely different knowledge scales and downstream necessities.

Pre-trained checkpoints are essentially the most versatile and provide the quickest convergence. These checkpoints settle for new patterns readily and work greatest for large-scale CPT with substantial token budgets exceeding 100 billion tokens. When utilizing pre-trained checkpoints with massive datasets, you should utilize the next studying price (comparable to 1e-4) to speed up data absorption. You then must steadily scale back the training price again to roughly 1e-6 for mannequin stability earlier than working SFT to let the mannequin “settle” into what it realized with out overshooting. Bear in mind that pre-trained checkpoints haven’t any directions for tuning. After CPT, you need to run SFT to make the mannequin helpful for downstream duties.
Mid-trained checkpoints steadiness flexibility and alignment. They settle for area data whereas retaining some instruction-following habits. Use mid-trained checkpoints for medium-sized datasets the place you need quicker area adaptation than post-trained however extra stability than pre-trained. Mid-trained checkpoints work properly for full rank coaching, which updates each parameter within the mannequin throughout fine-tuning, with massive, structured datasets.
Publish-trained checkpoints are essentially the most immune to new patterns however protect instruction-following and normal capabilities. Use post-trained for smaller-scale CPT the place preserving alignment issues greater than maximizing area data absorption. Publish-trained checkpoints are the advisable place to begin for LoRA (Low-Rank Adaptation), which freezes the unique mannequin weights and trains small adapter matrices on prime, and different parameter-efficient fine-tuning strategies, as they keep the mannequin’s present capabilities whereas permitting focused adaptation. For small datasets or later-stage checkpoints, use conservative studying price values from the service defaults.

Determine 2: Checkpoint choice for continued pre-training. Pre-trained checkpoints provide most flexibility for giant datasets however require SFT afterward to revive instruction-following. Publish-trained checkpoints protect alignment and swimsuit smaller datasets or parameter-efficient strategies like LoRA.

Knowledge mixing technique

With out knowledge mixing, coaching on slender area knowledge may cause the mannequin to change into unstable, leading to erratic coaching habits (gradient instability or loss spikes) or a sudden degradation in efficiency.

When configuring knowledge mixing, steadiness your buyer knowledge round 50 % of the whole combine for many use circumstances. For SFT, at all times embody the “reasoning-instruction-following” class in your Nova knowledge combine. This single class considerably improves generic benchmark efficiency after fine-tuning. Skipping this class is a standard reason behind degraded reasoning efficiency in fine-tuned fashions.

Knowledge mixing may be very delicate to studying price. Deviating from the default studying price when utilizing knowledge mixing causes instability. That is the commonest mistake practitioners make. When you observe coaching instability with knowledge mixing, the training price is the primary suspect.

Discovering the optimum mixing ratio requires experimentation. Maintain your area knowledge fixed and differ the Nova knowledge proportion throughout a number of runs. Area efficiency usually stays fixed whereas normal capabilities preserve bettering the extra Nova knowledge is blended in. Place your highest-quality knowledge towards the top of coaching for higher convergence.

Coaching mode: Low-Rank Adaptation (LoRA) vs Full Rank

Amazon Nova Forge helps two coaching modes that decide how mannequin parameters are up to date throughout coaching:

LoRA updates solely adapter layers, providing decrease compute prices, quicker iteration, and compatibility with on-demand inference. LoRA achieves close to Full Rank efficiency for many duties whereas being extra forgiving of suboptimal hyperparameters. The default alpha scaling issue of 64 works for many duties. Improve alpha if LoRA is under-adapting to your knowledge or lower it if LoRA is over-adapting and dropping normal capabilities. Use post-trained checkpoints as your place to begin for LoRA coaching.
Full Rank updates all mannequin parameters, offering most adaptation capability. Full Rank requires Amazon Bedrock Provisioned Throughput for deployment (On-Demand is simply out there for LoRA-based customization) and better compute throughout coaching. Use Full Rank when you have got validated your pipeline and your deployment structure justifies the extra price. Mid-trained checkpoints work properly for Full Rank coaching with massive, structured datasets.

Begin with LoRA to validate your pipeline, knowledge high quality, and reward perform (for RFT). Graduate to Full Rank when you have got confirmed the strategy works, and your manufacturing necessities justify it (for instance, mannequin efficiency or price constraints).

Really helpful workflow

Making use of these strategic choices to your particular state of affairs is determined by what knowledge and targets you have got. The next paths map your beginning situations to the appropriate sequence of methods.

In case you have labeled demonstrations and a verifiable reward perform (SFT then RFT):

Begin with SFT utilizing LoRA to show the goal habits and set up baseline competency.
Allow knowledge mixing with “reasoning-instruction-following” included to protect the mannequin’s skill to observe structured prompts and produce well-formatted outputs throughout area adaptation.
Use default studying charges with out modification.
Monitor validation loss to pick out one of the best SFT checkpoint.
Graduate to RFT on the SFT checkpoint to optimize additional by means of reward indicators.
Contemplate Full Rank coaching solely after validating the strategy with LoRA.
Check totally on each your area process and normal benchmarks earlier than manufacturing deployment (see the Experiments and insights part for an instance).

When you can outline verifiable outcomes however can not simply label responses at scale (RFT solely):

Consider base mannequin efficiency on a consultant pattern of your process first.
Proceed with RFT instantly if the bottom mannequin achieves greater than roughly 5 % optimistic reward.
Fall again to SFT if reward scores are persistently close to zero. The mannequin wants baseline competency earlier than reward-guided studying can take impact.

If the bottom mannequin lacks area vocabulary or data your process requires, begin with CPT:

Run CPT to soak up area data from unlabeled textual content.
Comply with with SFT. Pre-trained checkpoints used for CPT haven’t any instruction tuning, so SFT is required after CPT to make the mannequin helpful.
Optionally observe with RFT to additional optimize efficiency.

Parameter configuration

With strategic choices made, now you can optimize particular hyperparameters that govern how every method executes. This part gives steerage for every method.

Studying price configuration

Studying price controls how rapidly the mannequin updates primarily based on coaching indicators. Service defaults signify examined configurations that work throughout various use circumstances.

For CPT: Begin at service defaults. For giant datasets exceeding one trillion tokens, you should utilize the next studying price (comparable to 1e-4) to speed up data absorption, however you want a ramp-down stage to scale back the training price again to roughly 1e-6 for mannequin stability earlier than SFT. The constant_steps parameter controls what number of steps the mannequin trains on the peak studying price earlier than this ramp-down stage begins. Improve constant_steps for very massive token runs the place extra steps at full studying price assist area absorption. For smaller datasets or later-stage checkpoints, use the default (decrease) studying price from the beginning.
For SFT: Persist with service defaults, particularly with knowledge mixing. The advisable studying price is 1e-5 for LoRA and 5e-6 for full-rank SFT. Deviating from the default studying price when mixing Nova knowledge causes instability. When you observe coaching instability with knowledge mixing, the training price is the primary suspect.
For RFT: Begin at service defaults. Modify in small multiplier increments provided that wanted. If reward drops all of a sudden and doesn’t get well, the training price is probably going too excessive. Even a small multiplier improve can drop efficiency beneath baseline.

Configure warmup steps to roughly 15 % of your whole coaching steps. Warmup stabilizes preliminary coaching by steadily growing the training price reasonably than beginning on the full worth.

Batch measurement and coaching length

Batch measurement (managed by global_batch_size) is the batch parameter throughout all coaching strategies (CPT, SFT, RFT) and all environments (SageMaker Serverless, SMTJ, HyperPod). It defines the variety of coaching samples processed per optimizer step. For CPT and SFT, that is simple with one pattern equal to at least one input-output pair (SFT) or one token sequence (CPT). RFT introduces a further parameter, number_generation, that controls what number of candidate responses are generated per immediate for reward scoring. This parameter doesn’t exist in CPT or SFT recipes, as a result of these strategies practice instantly on offered input-output pairs reasonably than producing candidates. When the variety of generations parameter is current, batch measurement semantics differ between environments. Getting this improper results in sudden habits.

On SMTJ (RFT solely): Batch measurement means prompts per step. Every immediate generates N candidate responses (managed by number_generation). Whole samples per step equals batch measurement multiplied by variety of generations.
On SageMaker HyperPod (RFT solely): Batch measurement means whole samples per step (prompts multiplied by generations). Translate fastidiously when shifting configurations between environments.

For CPT, goal 2-20 million tokens per step. Use 20 million for giant token budgets and a pair of million for smaller budgets. Calculate international batch measurement as the closest energy of two of tokens per step divided by max sequence size. For instance, 4 million tokens per step with a 4096-sequence size yields a batch measurement of roughly 1024. Smaller batch sizes produce noisier gradients, which can assist generalization and allow quicker iteration. Bigger batch sizes produce smoother gradients however might over-smooth domain-specific indicators. Begin with reasonable batch sizes for stability.

Match your max sequence size to your knowledge distribution. Don’t exceed what your knowledge wants. Smaller context lengths improve token throughput and scale back coaching prices. For CPT, course of at most one epoch of your dataset. Keep away from repeating knowledge, as a number of epochs on restricted CPT knowledge results in overfitting and lack of normal capabilities. Monitor validation loss to trace progress. For SFT, Full Rank coaching usually wants fewer epochs than LoRA. LoRA coaching can tolerate barely extra epochs. Monitor validation loss to detect overfitting and choose one of the best checkpoint.

RFT-specific parameters

RFT introduces further parameters not current in CPT or SFT.

Variety of generations controls what number of candidate responses the mannequin generates per immediate for the reward perform to check. Fewer candidates imply quicker coaching however much less sign range. Too many candidates add noise with out bettering sign and practically double coaching time. Reasonable values hit one of the best accuracy-to-time ratio. Improve in case your process has excessive variance in response high quality. Lower for speedy reward perform iteration throughout improvement.
KL-Divergence Loss Coefficient constrains how far the mannequin’s coverage can drift from its authentic habits. This parameter is out there on SMTJ solely. A low coefficient lets the mannequin discover freely however dangers discovering shortcuts that recreation the reward perform. A excessive coefficient prevents significant studying by pulling the mannequin again to its place to begin. Improve if KL divergence spikes throughout coaching to steadiness real studying in opposition to behavioral drift.
Reasoning Effort controls how a lot chain-of-thought reasoning the mannequin performs earlier than answering. Excessive reasoning effort produces one of the best accuracy however will increase latency and serving price. Low reasoning effort presents quicker inference with modest accuracy trade-offs. Use excessive for max accuracy throughout validation, then contemplate lowering for latency-sensitive manufacturing deployments.
Lambda Concurrency Restrict (SMTJ solely) controls parallel AWS Lambda capabilities for reward analysis. Improve considerably for quick reward capabilities to keep away from analysis throughput changing into a bottleneck.

Do not forget that batch measurement semantics differ between platforms. On SMTJ, global_batch_size means prompts per step the place every generates N candidates. On SageMaker HyperPod, global_batch_size means whole samples (prompts multiplied by generations). Translate fastidiously between environments.

Regularization parameters

Regularization parameters assist forestall overfitting, particularly on smaller datasets.

Weight decay defaults to zero. Improve modestly if you happen to observe overfitting on small datasets. Weight decay applies L2 regularization to constrain parameter magnitudes.
Dropout (hidden and a focus) defaults to zero. Improve hidden dropout modestly for smaller datasets to scale back overfitting. Improve consideration dropout cautiously, as excessive values can harm advanced reasoning capabilities.
Clip ratio and age tolerance are superior SageMaker HyperPod parameters. Clip ratio limits how a lot the coverage can change in a single coaching step. Age tolerance determines how lengthy coaching knowledge stays legitimate earlier than being thought-about too stale. Refit frequency controls how usually the mannequin collects contemporary coaching knowledge. Defaults work for many use circumstances. Solely alter these superior settings if you happen to perceive the precise stability concern you’re addressing.

Experiments and insights

With these hyperparameters in thoughts, we ran a sequence of HPO experiments utilizing Amazon Nova 2.0 throughout public benchmarks together with CoCoHD, MedReason and LLaVA-CoT. The next desk summarizes the experimental configurations and key findings for every parameter sweep.

Dataset	Rank	Alpha	GBS	LR	Max Steps	Warmup	Base Goal Perf.	SFT Goal Perf.	Rank	Perf Diff
MedReason	32	64	32	1.00E-05	312	47	57.38%	63.54%	2	10.75% ↑
MedReason	64	64	32	1.00E-05	312	47	57.38%	63.78%	1	11.16% ↑
MedReason	32	64	32	5.00E-06	312	47	57.38%	63.33%
MedReason	32	64	32	1.00E-05	624	94	57.38%	61.42%
LLavaCOT	64	64	32	1.00E-05	312	47	16.22%	68.47%	1	322.13% ↑
LLavaCOT	32	128	32	1.00E-05	312	47	16.22%	65.77%	2	305.49% ↑

We ran LoRA SFT on Amazon Nova 2 Lite utilizing Nova Forge with rank 32, alpha 64, batch measurement 32, 15 % warmup, and 1 epoch, sweeping solely the training price to isolate its impact on course accuracy. The service default of 1e-5 produced one of the best outcome at 63.54 %, a ten.75 % raise over the v4 base. Dropping the training price to 5e-6 adversely impacted goal efficiency with out meaningfully defending normal capabilities, as MMLU, IFEval, and GPQA scores have been inside noise of the 1e-5 run. Doubling to 2 epochs on the identical studying price dropped accuracy to 61.42 %, confirming that overtraining on slender area knowledge erodes each area and normal efficiency.

We diverse LoRA rank (32 vs 64) and alpha (64 vs 128) on a multimodal reasoning process the place the bottom mannequin begins at solely 16.22 % accuracy. One of the best configuration, rank 64 with alpha 64, lifted accuracy to 68.47 %, a 322 % relative enchancment over the bottom. Doubling alpha to 128 at rank 32 produced an identical goal achieve at 65.77 %, however at a meaningfully larger general-capability regression price. For duties the place the baseline accuracy is low, growing rank is a higher-leverage adjustment than growing alpha. Alpha needs to be elevated solely when LoRA is under-adapting, and decreased if the mannequin is dropping normal capabilities.

No single hyperparameter configuration works greatest for all use circumstances. These advisable defaults are robust beginning factors, not ensures of optimum efficiency.

Widespread pitfalls and the right way to keep away from them

The next desk summarizes the commonest errors practitioners ought to keep away from when tuning Amazon Nova Forge fashions.

Pitfall	Symptom	Answer
Skipping SFT earlier than RFT	RFT produces no enchancment or degrades efficiency	Run SFT first to get the mannequin into the appropriate behavioral neighborhood earlier than RFT optimization.
Deviating from default LR with knowledge mixing	Coaching instability, loss spikes, functionality collapse	Persist with service defaults when utilizing knowledge mixing. That is the commonest mistake.
Poor reward perform high quality	Accuracy decreases regardless of coaching, or mannequin video games the metric	Refine your reward perform earlier than altering any coaching parameter. Validate with no less than two impartial judges.
A number of epochs on restricted CPT knowledge	Overfitting, lack of normal capabilities, memorization	Course of at most one epoch of your CPT dataset. Monitor validation loss to detect overfitting early.
Mismatched reasoning settings	Inference habits doesn’t match coaching habits	Match `reasoning_enabled` between coaching and inference. When you practice with reasoning, infer with reasoning.

When tuning fashions with Nova Forge, spend money on your reward perform earlier than anything. A poor reward perform will lower accuracy no matter different hyperparameter selections, whereas a refined one produces constant positive aspects on equivalent infrastructure. Be sure that your reward perform has discriminative energy throughout the mannequin’s high quality vary, as a result of if the whole lot scores excessive, RFT has no gradient to optimize.

The identical validation self-discipline applies to LLM-as-judge choice. Your decide mannequin should reliably distinguish high quality variations throughout the mannequin’s output vary. Validate decide settlement with no less than two impartial evaluators earlier than committing to a coaching run.

Bear in mind that coaching surroundings stability mechanisms differ between platforms. SMTJ applies steady KL penalty as a delicate constraint, whereas SageMaker HyperPod makes use of gradient clipping as a tough cap per step. Each obtain comparable accuracy, however they require completely different tuning intuitions. Don’t assume parameters switch instantly between environments.

All through all of this, prioritize knowledge high quality over quantity. Filtering aggressively and ensuring coaching examples precisely signify the goal habits will outperform merely scaling up low-quality knowledge.

Measuring success

Once you apply correct hyperparameter tuning, the outcomes may be substantial. The AWS China Utilized Science workforce demonstrated this of their analysis of Amazon Nova Forge, attaining 17 % F1 rating enchancment on a fancy Voice of Buyer classification process whereas sustaining near-baseline MMLU scores.

Key metrics to watch

Coaching loss ought to lower steadily with out sudden spikes. Spikes usually point out studying price points or knowledge high quality issues.

Validation loss reveals overfitting. If validation loss will increase whereas coaching loss decreases, you’re overfitting. Scale back epochs, improve regularization, or add extra various knowledge.

KL divergence (for RFT) exhibits how far the coverage has drifted. Sudden spikes recommend the mannequin is making massive, doubtlessly unstable updates. Improve the KL loss coefficient if this happens.

Reward metrics (for RFT) ought to enhance steadily. If reward improves quickly then plateaus or drops, the mannequin could also be gaming the reward perform. Revisit your reward design.

Conclusion

Optimizing mannequin customization with Amazon Nova Forge requires balancing artwork and science. The artwork includes understanding trade-offs: checkpoint choice, knowledge mixing technique, and coaching mode choices form your end result greater than any single hyperparameter. The science includes systematic tuning: studying price, batch measurement, and technique-specific parameters require cautious configuration primarily based in your knowledge and targets.

Knowledge and reward high quality exceed any hyperparameter in significance. Earlier than tuning coaching parameters, optimize your knowledge pipeline and reward perform. Begin with service defaults, particularly for studying price and knowledge mixing, as these defaults exist as a result of they work throughout a variety of use circumstances.

For many manufacturing eventualities, the strongest pipeline is SFT adopted by RFT. RFT refines present functionality however can not get well from a low baseline, so supervised fine-tuning wants to determine strong efficiency first. Knowledge mixing needs to be handled as important for manufacturing workloads, not optionally available. It prevents catastrophic forgetting and gives optimization stability wanted for dependable outcomes.

When working with continued pre-training, checkpoint choice is essentially the most impactful resolution you’ll make. Match checkpoint flexibility to your knowledge scale: earlier checkpoints for large-scale area adaptation, later checkpoints for smaller datasets the place preserving instruction-following habits issues.

To get began with Amazon Nova Forge, discover the Amazon Nova documentation and the SageMaker HyperPod recipes repository on GitHub. For hands-on examples of knowledge mixing in motion, see the Nova Forge knowledge mixing weblog put up. For a deeper dive into RFT with Nova Forge see the Reinforcement fine-tuning for Amazon Nova: Educating AI by means of suggestions weblog put up.

Acknowledgements

The authors want to thank Zheng Du, Bharathan Balaji, Anjie Fang, and Mengnong Xu from the AWS AGI Customization Science workforce for his or her technical steerage.