{"id":15317,"date":"2026-06-01T18:23:46","date_gmt":"2026-06-01T18:23:46","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=15317"},"modified":"2026-06-01T18:23:46","modified_gmt":"2026-06-01T18:23:46","slug":"how-the-group-educated-gemma-to-suppose-with-tunix-and-tpus","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=15317","title":{"rendered":"How the group educated Gemma to &#8220;Suppose&#8221; with Tunix and TPUs"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p data-block-key=\"ambgv\">Giant Language Fashions (LLMs) usually profit from &#8220;pondering&#8221; earlier than they converse for complicated duties. Frontier LLMs like Gemini 3 and main open weight fashions like Gemma 4 can produce specific reasoning traces, generally referred to as Chain-of-Thought, earlier than answering consumer questions. However how this reasoning functionality is educated is usually not disclosed. Whereas there are <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google\/tunix\/blob\/main\/examples\/grpo_gemma.ipynb\">many reasoning tutorials<\/a> obtainable on the Web to coach for easy verifiable duties equivalent to math or coding, accessible and easy-to-reproduce coaching recipes (together with information, coaching technique, runnable code and evaluations) for normal reasoning stay scarce.<\/p>\n<p data-block-key=\"8cdgc\">This motivated us to carry the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\">Google Tunix Hack: Prepare a mannequin to point out its work<\/a> hackathon on Kaggle: we challenged builders to remodel non-reasoning base fashions (Gemma-2-2B and Gemma-3-1B) into normal reasoning fashions, utilizing Tunix and Kaggle TPUs. The response was overwhelming: over 11,000 entrants and 300+ high-quality submissions proved that first rate reasoning coaching might be achieved by the group even with a really restricted compute finances (Kaggle TPU v5e-8 for 9 hours). On this publish, we\u2019ll spotlight the strategies utilized by the winners and share key recipes that enable fashions to cause throughout key vertical industries, so you may practice your individual reasoning fashions.<\/p>\n<h2 data-block-key=\"19yix\" id=\"highlighting-the-winners:-key-innovations\"><b>Highlighting the Winners: Key Improvements<\/b><\/h2>\n<p data-block-key=\"3brj1\">The successful submissions demonstrated a complicated understanding of post-training, combining supervised studying, choice optimization, and reinforcement studying in inventive methods.<\/p>\n<h4 data-block-key=\"f2mxa\" id=\"1st-place:-g-rar-(rubric-based-reinforcement-learning)\">\ud83e\udd47 1st Place: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/new-writeup-1768107128907\">G-RaR (Rubric-Based mostly Reinforcement Studying)<\/a><\/h4>\n<p data-block-key=\"4pd0a\">G-RaR trains Gemma fashions to provide structured reasoning by combining Supervised Fantastic-Tuning (SFT) with GRPO, pushed by a novel rubric-based LLM-as-judge reward system.<\/p>\n<ul>\n<li data-block-key=\"d3jpd\"><b>How It Improves Reasoning<\/b> The mannequin&#8217;s reasoning energy is improved by explicitly coaching it to &#8220;present its work&#8221; inside <code><reasoning\/><\/code> tags earlier than outputting a solution. The underlying method (for GRPO), G-RaR (Rubrics as Rewards), makes use of a bigger decide mannequin (Gemma-3-12B) to guage the standard of those intermediate logical steps based mostly on task-specific rubrics. By changing discrete rubric scores into steady, normalized reward indicators, the method offers dense, clean suggestions on the mannequin&#8217;s logic. This enables the mannequin to repeatedly enhance its reasoning capabilities with out relying solely on exact-match correctness, making it extremely efficient even for open-ended, non-verifiable duties.<\/li>\n<li data-block-key=\"dtghe\"><b>Technical Resolution<\/b> The staff utilized a two-stage post-training pipeline:\n<ul>\n<li data-block-key=\"e468\"><b>Stage 1 (SFT):<\/b> The Gemma-2-2B-IT mannequin is fine-tuned through LoRA on a ~33k pattern dataset to determine a baseline. This &#8220;heat begin&#8221; teaches the mannequin to reliably output the <code><reasoning>...<\/reasoning><answer>...<\/answer><\/code> construction.<\/li>\n<li data-block-key=\"1koet\"><b>Stage 2 (GRPO):<\/b> The mannequin is then refined utilizing GRPO-based on a composite reward perform (Format Reward + Actual Reply Reward + G-RaR Rating). To beat compute constraints, the staff used a split-mesh structure on a single Kaggle TPU v5e-8, putting the coverage\/reference fashions on one mesh and the decide mannequin on the opposite for true parallel execution.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h4 data-block-key=\"5nc9l\" id=\"2nd-place:-pinocchio-1b-(creating-a-reasoning-model-in-3-acts)\">\ud83e\udd48 2nd Place: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/new-writeup-1767940008062\">Pinocchio-1B (Making a Reasoning Mannequin in 3 Acts)<\/a><\/h4>\n<p data-block-key=\"afvo1\">Evolving a 1B parameter mannequin right into a structured reasoning engine (&#8220;Pinocchio&#8221;) through a extremely environment friendly, 9-hour TPU pipeline (SFT \u2192 SimPO \u2192 GRPO)<\/p>\n<ul>\n<li data-block-key=\"7qfp0\"><b>The way it Improves Reasoning<\/b> The mannequin learns to generate a structured <code><reasoning\/><\/code> hint earlier than answering, shifting from primary sample matching to logical deduction. That is constructed sequentially: SFT instills foundational Chain-of-Thought, SimPO locks in strict formatting (stopping verbosity hacks), and GRPO refines logic through the use of an LLM-as-a-Choose to reward coherence and closely penalize hallucinations..<\/li>\n<li data-block-key=\"3l78r\"><b>Technical Resolution<\/b> The pipeline consists of three phases:\n<ul>\n<li data-block-key=\"2lmv3\"><b>SFT (Distillation):<\/b> Skilled on 70k prompts utilizing an OSS-120B trainer mannequin and a Gemini task-router.<\/li>\n<li data-block-key=\"9i5q6\"><b>SimPO (Alignment):<\/b> Changed memory-heavy DPO to effectively implement strict XML formatting.<\/li>\n<li data-block-key=\"9bshb\"><b>GRPO (Refinement):<\/b> Used Gemini 2.0 Flash as an asynchronous decide to dynamically reward accuracy, logic, and format.<\/li>\n<\/ul>\n<\/li>\n<li data-block-key=\"7q91u\"><b>Customizing Tunix:<\/b> The staff explicitly prolonged the Tunix library to help this workflow by:\n<ul>\n<li data-block-key=\"62pro\">Injecting a customized SimPO loss perform (with size normalization) into the <code>DPOTrainer<\/code>.<\/li>\n<li data-block-key=\"63ch0\">Making a high-throughput, asynchronous analysis engine to course of GRPO reward indicators on the fly.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h4 data-block-key=\"bb9f8\" id=\"3rd-place:-idea-e-distillation-with-curriculum-guided-grpo-training\">\ud83e\udd49 third Place: <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/idea-e-distillation-with-curriculum-guided-grpo\">IDEA-E Distillation with Curriculum Guided GRPO Coaching<\/a><\/h4>\n<p data-block-key=\"9en8p\">Distilling the structured &#8220;<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.eapon.ca\/wp-content\/uploads\/2017\/03\/12460_IDEA_Framework_Workbook__final___April_2013_.pdf\">IDEA-E<\/a>&#8221; moral reasoning framework right into a 2B mannequin utilizing curriculum-guided GRPO and a quick TF-IDF reward system.<\/p>\n<ul>\n<li data-block-key=\"4em8p\"><b>Why it Improves Reasoning<\/b> The IDEA-E scaffold forces the mannequin by means of a step-by-step logical deduction earlier than answering, stopping untimely guessing. Concurrently, the TF-IDF reward prevents verbose &#8220;yapping&#8221; by incentivizing using context-relevant vocabulary within the reasoning hint.<\/li>\n<li data-block-key=\"5obui\"><b>Technical Resolution<\/b> The pipeline options two phases:\n<ul>\n<li data-block-key=\"8li9a\"><b>SFT:<\/b> Fantastic-tuning on trainer information to determine the IDEA-E format.<\/li>\n<li data-block-key=\"6qk7r\"><b>GRPO:<\/b> Reinforcement studying utilizing curriculum steerage and a TF-IDF reward as an alternative of a sluggish LLM decide.<\/li>\n<\/ul>\n<\/li>\n<li data-block-key=\"akr1i\"><b>Customizing Tunix:<\/b> The staff prolonged Tunix by integrating their customized TF-IDF reward perform into the Tunix GRPO pipeline, permitting for speedy, non-blocking reward calculations on the CPU.<\/li>\n<\/ul>\n<h2 data-block-key=\"tsmla\" id=\"honorable-mentions\"><b>Honorable Mentions<\/b><\/h2>\n<p data-block-key=\"9jods\">Whereas the highest three spots took the rostrum, a number of different submissions showcased robust creativity and technical depth:<\/p>\n<h4 data-block-key=\"2c4dd\" id=\"eliciting-reasoning-via-on-policy-distillation\">\ud83c\udf1f <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/eliciting-reasoning-via-on-policy-distillation\">Eliciting Reasoning through On-Coverage Distillation<\/a><\/h4>\n<ul>\n<li data-block-key=\"69ao2\"><b>The Strategy:<\/b> As a substitute of relying solely on static offline datasets, they carried out an <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/thinkingmachines.ai\/blog\/on-policy-distillation\/\">on-policy distillation<\/a> technique from scratch throughout the Tunix framework. They used a bigger, extremely succesful trainer mannequin (educated in 3 phases) to generate reasoning traces <i>dynamically<\/i> in response to the coed mannequin&#8217;s generations throughout coaching, making a tighter suggestions loop.<\/li>\n<\/ul>\n<h4 data-block-key=\"roken\" id=\"gemma2-deep&quot;-incentivizing-gemma-to-reason-before-answering\">\ud83c\udf1f <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/theitcrow-gemma2-deep-1767955818518\">Gemma2-Deep\u201d Incentivizing Gemma to Purpose earlier than Answering<\/a><\/h4>\n<ul>\n<li data-block-key=\"c6sat\"><b>The Strategy:<\/b> Developed by participant <i>TheItCrow<\/i>, this challenge targeted on customized dataset curation and structured reward modeling.\n<ul>\n<li data-block-key=\"afnl9\">They curated the Deep-CoRGI (Cognitive Reasoning Guided Interface) dataset, particularly designed to show Chain of Thought.<\/li>\n<li data-block-key=\"304ek\">They educated a customized ThoughtTeacher reward mannequin to guage not simply the correctness of the ultimate reply, however the logical move of the reasoning steps themselves.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p data-block-key=\"3p6fv\">We&#8217;re additionally very impressed with a number of submissions that concentrate on reasoning coaching in particular domains, equivalent to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/new-writeup-1767963319979\">medical<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/introducing-gemmax\">chemistry<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/code\/rishirajacharya\/grpo-gemma3-1b-for-legal-data-with-tunix\">authorized<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/tunix-robotics-reasoning-single-session\">robotics<\/a>.<\/p>\n<ul>\n<li data-block-key=\"8igsf\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/new-writeup-1767963319979\"><b>Medical<\/b><\/a>: GRPO generates structured, step-by-step reasoning traces enhancing the interpretability and reliability of its complicated scientific problem-solving outputs<\/li>\n<li data-block-key=\"fgb78\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/introducing-gemmax\"><b>Chemistry<\/b><\/a>: step-by-step reasoning traces benefited the chemistry use case by enabling a small language mannequin to unravel complicated chemistry reasoning duties.<\/li>\n<li data-block-key=\"2kof2\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/code\/rishirajacharya\/grpo-gemma3-1b-for-legal-data-with-tunix\"><b>Authorized<\/b><\/a>: Put up-training through GRPO reinforces structured, step-by-step reasoning, enabling the Gemma 3 1B mannequin to precisely analyze complicated authorized information and produce dependable, logically sound interpretations.<\/li>\n<li data-block-key=\"1pggu\"><a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/competitions\/google-tunix-hackathon\/writeups\/tunix-robotics-reasoning-single-session\"><b>Robotics<\/b><\/a>: step-by-step reasoning era permits the mannequin to unravel multi-step robotics planning and decision-making duties below single-session coaching constraints.<\/li>\n<\/ul>\n<h2 data-block-key=\"tzor5\" id=\"ready-to-build\"><b>Able to construct?<\/b><\/h2>\n<p data-block-key=\"3l6au\">The Tunix Hackathon democratizes coaching extremely succesful, structured reasoning fashions by producing so many spectacular reasoning coaching recipes that at the moment are all publicly obtainable. With Tunix and free Kaggle TPUs, builders can now obtain robust outcomes on accessible {hardware}.<\/p>\n<p data-block-key=\"66mvj\">In case you&#8217;re prepared to begin post-training your individual reasoning fashions, listed here are some assets to get began:<\/p>\n<ol>\n<li data-block-key=\"1ep6o\"><b>Discover Tunix on GitHub:<\/b> Take a look at the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google\/tunix\">official Tunix repository<\/a> to entry the code, documentation, and group examples.<\/li>\n<li data-block-key=\"65dr9\"><b>Attempt a Colab Tutorial:<\/b> Spin up a free TPU occasion in Google Colab and check out the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google\/tunix\/tree\/main\/examples\">Tunix examples<\/a> to run your first SFT or RL loop.<\/li>\n<li data-block-key=\"6s5je\"><b>Be taught Extra About Reinforcement Studying:<\/b> Learn the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/tunix.readthedocs.io\/en\/latest\/design.html#rl\">RL documentation<\/a> in Tunix to know the best way to leverage reinforcement studying to finetune your mannequin.<\/li>\n<\/ol>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Giant Language Fashions (LLMs) usually profit from &#8220;pondering&#8221; earlier than they converse for complicated duties. Frontier LLMs like Gemini 3 and main open weight fashions like Gemma 4 can produce specific reasoning traces, generally referred to as Chain-of-Thought, earlier than answering consumer questions. However how this reasoning functionality is educated is usually not disclosed. Whereas [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":15319,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[1582,1456,7308,6893,9285],"class_list":["post-15317","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-community","tag-gemma","tag-tpus","tag-trained","tag-tunix"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15317","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15317"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15317\/revisions"}],"predecessor-version":[{"id":15318,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15317\/revisions\/15318"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15319"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15317"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15317"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15317"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-02 18:01:52 UTC -->