{"id":11842,"date":"2026-02-16T02:21:54","date_gmt":"2026-02-16T02:21:54","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=11842"},"modified":"2026-02-16T02:21:55","modified_gmt":"2026-02-16T02:21:55","slug":"accomplished-hyperparameter-switch-throughout-modules-width-depth-batch-and-length","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=11842","title":{"rendered":"Accomplished Hyperparameter Switch throughout Modules, Width, Depth, Batch and Length"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p>Hyperparameter tuning can dramatically impression coaching stability and closing efficiency of large-scale fashions. Current works on neural community parameterisations, comparable to \u03bcP, have enabled switch of optimum world hyperparameters throughout mannequin sizes. These works suggest an empirical observe of seek for optimum world base hyperparameters at a small mannequin dimension, and switch to a big dimension. We prolong these works in two key methods. To deal with scaling alongside most essential scaling axes, we suggest the Full(d) Parameterisation that unifies scaling in width and depth \u2014 utilizing an adaptation of CompleteP \u2014 in addition to in batch-size and coaching length. Secondly, with our parameterisation, we examine per-module hyperparameter optimisation and switch. We characterise the empirical challenges of navigating the high-dimensional hyperparameter panorama, and suggest sensible pointers for tackling this optimisation drawback. We reveal that, with the proper parameterisation, hyperparameter switch holds even within the per-module hyperparameter regime. Our research covers an intensive vary of optimisation hyperparameters of recent fashions: studying charges, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments reveal vital coaching pace enhancements in Giant Language Fashions with the transferred per-module hyperparameters.<\/p>\n<ul class=\"links-stacked\">\n<li>\u2020 College of Cambridge<\/li>\n<li>** Work performed whereas at Apple<\/li>\n<\/ul>\n<figure id=\"figure1\" class=\"\" aria-label=\"Figure 1\">\n<div class=\"bg-gray-light text-base rounded\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/mlr.cdn-apple.com\/media\/figure1_converted_59c5246a33.png\" aria-label=\"Diagram illustrating hyperparameter optimisation at the 50M parameter scale, comparing global and per-module strategies and highlighting transfer to a much larger FLOP budget using the Complete(d)P parameterisation.\" tabindex=\"-1\" target=\"_blank\" class=\"mt-0\"><img decoding=\"async\" src=\"https:\/\/mlr.cdn-apple.com\/media\/figure1_converted_59c5246a33.png\" alt=\"Diagram illustrating hyperparameter optimisation at the 50M parameter scale, comparing global and per-module strategies and highlighting transfer to a much larger FLOP budget using the Complete(d)P parameterisation.\" loading=\"lazy\" class=\"bg-gray-light\"\/><\/a><\/div><figcaption class=\"muted\" aria-hidden=\"true\">Determine 1: We optimise hyperparameters at a small 50M parameters\/1.6B tokens scale (studying price, initialisation scale, Adam \u03b5, momenta, and weight decay) with an evolutionary technique. These hyperparameters (HPs) might be optimised both globally with a shared worth throughout all the mannequin, or per-module (with 13 module varieties, some moreover tuned per depth). The per-module method results in higher outcomes on the 50M scale\u2014optimum world HPs require 2.3\u00d7 longer coaching to attain the identical efficiency. Crucially, our new parameterisation, Full(d)P, allows direct switch (with out subsequent tuning) to a ~14000\u00d7 bigger FLOP price range.<\/figcaption><\/figure>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Hyperparameter tuning can dramatically impression coaching stability and closing efficiency of large-scale fashions. Current works on neural community parameterisations, comparable to \u03bcP, have enabled switch of optimum world hyperparameters throughout mannequin sizes. These works suggest an empirical observe of seek for optimum world base hyperparameters at a small mannequin dimension, and switch to a big [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":11844,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[5303,1642,5216,7869,2104,7868,764,2845],"class_list":["post-11842","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-batch","tag-completed","tag-depth","tag-duration","tag-hyperparameter","tag-modules","tag-transfer","tag-width"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11842","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=11842"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11842\/revisions"}],"predecessor-version":[{"id":11843,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11842\/revisions\/11843"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/11844"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=11842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=11842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=11842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69c6f7b5190636d50e9f6768. Config Timestamp: 2026-03-27 21:33:41 UTC, Cached Timestamp: 2026-04-09 08:43:03 UTC -->