{"id":15721,"date":"2026-06-14T08:23:12","date_gmt":"2026-06-14T08:23:12","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=15721"},"modified":"2026-06-14T08:23:12","modified_gmt":"2026-06-14T08:23:12","slug":"gemma-4-12b-the-developer-information","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=15721","title":{"rendered":"Gemma 4 12B: The Developer Information"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p data-block-key=\"dfccn\">Following the announcement in our <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/blog.google\/innovation-and-ai\/technology\/developers-tools\/introducing-gemma-4-12B\/\">launch weblog<\/a>, we&#8217;re releasing <b>Gemma 4 12B<\/b>, a dense multimodal mannequin with a <b>unified, encoder-free structure<\/b>.<\/p>\n<p data-block-key=\"cagch\">Gemma 4 12B introduces a number of milestones for native AI:<\/p>\n<ol>\n<li data-block-key=\"9is74\"><b>A multimodal encoder-free structure:<\/b> Bypassing heavy multi-stage imaginative and prescient and audio encoders totally, multimodal knowledge is fed straight into the LLM spine, lowering multimodal latency.<\/li>\n<li data-block-key=\"1gjoe\"><b>Our first medium-sized mannequin with audio enter:<\/b> Within the Gemma household, audio inputs had been restricted to small, light-weight edge architectures (e.g. E4B). Gemma 4 12B is the primary medium-sized mannequin able to natively ingesting audio.<\/li>\n<li data-block-key=\"9gvh9\"><b>Developer-friendly dimension<\/b>: Sufficiently small to run domestically on devoted GPU laptops with 16GB VRAM or unified reminiscence. To maximise native inference speeds, we&#8217;re moreover releasing a devoted <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/blog.google\/innovation-and-ai\/technology\/developers-tools\/multi-token-prediction-gemma-4\/\">multi-token prediction (MTP)<\/a> mannequin.<\/li>\n<li data-block-key=\"bgl2\"><b>New MacOS desktop expertise<\/b>: For the primary time, we&#8217;re releasing downloadable macOS desktop purposes, letting builders expertise totally native spoken and visible interplay straight on consumer-grade gadgets.<\/li>\n<\/ol>\n<h2 data-block-key=\"exs9i\" id=\"the-architecture\"><b>The Structure<\/b><\/h2>\n<p data-block-key=\"eo1qq\">Conventional multimodal fashions depend on frozen, separate imaginative and prescient encoders (e.g., Gemma 4 makes use of a 150M parameter imaginative and prescient mannequin for edge sizes and 550M for medium-sized fashions) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with a number of separate encoders earlier than feeding them to the LLM results in elevated latency and fragmented reminiscence footprints.<\/p>\n<p data-block-key=\"e8tes\">Gemma 4 12B solves these points by using a single decoder-only transformer containing the identical superior decoder construction because the Gemma 4 31B Dense mannequin.<\/p>\n<\/div>\n<div>\n<ul>\n<li data-block-key=\"dfccn\"><b>Imaginative and prescient embedder (35M parameters):<\/b> Replaces the 27 imaginative and prescient transformer layers of the opposite medium-sized Gemma 4 fashions. Uncooked 48&#215;48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location data on to the enter.<\/li>\n<li data-block-key=\"50o01\"><b>Audio wave projection:<\/b> Eliminates the separate audio encoder (skipping the 12 conformer layers utilized in Gemma 4 E2B and E4B). Uncooked 16 kHz audio indicators are sliced into 40ms frames (640 floats every) and projected linearly to the LLM enter house.<\/li>\n<li data-block-key=\"dp6l\"><b>Unified fine-tuning benefit:<\/b> As a result of imaginative and prescient, audio, and textual content inputs share the very same weights, <b>you not must co-tune separate frozen encoders<\/b>. Downstream adapter (e.g. LoRA) or full tuning naturally replace your entire multimodal token loop in a single move (by way of Hugging Face or Unsloth).<\/li>\n<\/ul>\n<p data-block-key=\"kd11\">For a extra in-depth overview of how this encoder-free structure works, try <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/newsletter.maartengrootendorst.com\/p\/a-visual-guide-to-gemma-4-12b\">A Visible Information to Gemma 4 12B<\/a>.<\/p>\n<h2 data-block-key=\"95znp\" id=\"capabilities\"><b>Capabilities<\/b><\/h2>\n<p data-block-key=\"dii46\">Gemma 4 12B achieves excellent efficiency, with capabilities equivalent to computerized speech recognition, agentic reasoning, diarization, video understanding, coding, and extra.<\/p>\n<p data-block-key=\"18k6j\">See under examples for an illustration of the mannequin&#8217;s agentic and multimodal capabilities:<\/p>\n<h2 data-block-key=\"e4fc8\" id=\"example-1:-gemma-4-12b-creates-a-local-image-processing-app-that-uses-gemma-4-12b\"><b>Instance 1: Gemma 4 12B creates an area picture processing app that makes use of Gemma 4 12B<\/b><\/h2>\n<\/div>\n<div>\n<p data-block-key=\"dfccn\">Due to its agentic and multimodal understanding capabilities, Gemma 4 12B might be simply used with present agent harnesses like OpenCode. On this instance, we served it domestically utilizing llama.cpp utilizing the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google-gemma\/gemma-skills\">gemma-skills<\/a> to code a Gradio app that helped the consumer course of pictures. This app was powered by the identical Gemma 4 12B mannequin that constructed it!<\/p>\n<h2 data-block-key=\"9o6z4\" id=\"example-2:-processing-5-minutes-of-video-at-1-fps-with-audio\"><b>Instance 2: Processing 5 minutes of Video at 1 FPS with audio<\/b><\/h2>\n<p data-block-key=\"al016\">We used Gemma 4 12B to analyse an element from the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=wYSncx9zLIU&amp;start=932&amp;end=1245\">Google IO Keynote second<\/a> from Might 19, particularly the 5 minutes between 00:15:32 and 00:20:45. To try this, we extracted all of the frames of that phase (at 1 FPS), plus the immediate and the audio from the video:<\/p>\n<\/div>\n<div>\n<p data-block-key=\"dfccn\">Immediate:<\/p>\n<ol>\n<li data-block-key=\"4kue7\">313 frames (at 1FPS, pictures resized to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/gemma\/docs\/capabilities\/vision\/image?_gl=1*jm0coc*_up*MQ..*_ga*MTI2MzkwODM0NS4xNzgwNDI5Mzg5*_ga_P1DBVKWT6V*czE3ODA0MjkzODkkbzEkZzAkdDE3ODA0MjkzODkkajYwJGwwJGgxOTEyOTEwNzc5#variable_resolution_token_budget\">visible token finances<\/a> 70)<\/li>\n<li data-block-key=\"b8fbs\"><code>\"what occurs when the person takes a selfie?\"<\/code><\/li>\n<li data-block-key=\"7a0pu\">the Video&#8217;s audio + the immediate under<\/li>\n<\/ol>\n<blockquote data-block-key=\"dhhdt\"><p>When the person takes a &#8220;selfie&#8221; or is proven holding a smartphone in entrance of his face in these demonstration movies, it&#8217;s a intelligent method to visually signify how the AI fashions can take present media\u2014like a private photograph or a video clip\u2014and &#8220;reimagine&#8221; it. In these particular clips, the mannequin is taking a selfie and utilizing it as a base to generate varied situations (like an individual in an area station, or strolling by way of a forest).<\/p>\n<p>Basically, the person isn&#8217;t really taking a selfie; relatively, he&#8217;s **performing out a visible metaphor for the AI&#8217;s functionality to take one particular enter (a &#8220;selfie&#8221;) and generate a complete world of latest content material primarily based on it.** That is a part of the &#8220;Swap&#8221; and &#8220;Construct worlds&#8221; demonstrations of the Gemini Omni mannequin, displaying its skill to carry out advanced, multi-modal reasoning and artistic era.<\/p>\n<\/blockquote>\n<h2 data-block-key=\"rgh1e\" id=\"on-device-and-desktop-serving:-powered-by-litert-lm\"><b>On-Machine &amp; Desktop Serving: Powered by LiteRT-LM<\/b><\/h2>\n<p data-block-key=\"8bicr\">In tandem with the Gemma 4 12B launch, we&#8217;re formally introducing highly effective on-device developer integrations powered by LiteRT-LM, bringing zero-latency native AI execution natively to plain desktop environments:<\/p>\n<p data-block-key=\"71i11\"><b>1.Native MacOS Apps<\/b>: The cellular <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/developers.google.com\/edge\/gallery\"><b>Google AI Edge Gallery<\/b><\/a> is formally increasing to desktop platforms, operating Gemma 4 12B offline, natively on Apple Silicon GPUs. It comes with a safe sandboxed Python execution loop to write down, execute, and plot scientific charts contained in the chat bubble. In parallel, the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/edge\/eloquent\"><b>Google AI Edge Eloquent<\/b><\/a> app on Mac launches help for Gemma 12B to energy Voice Edit conversational inputs.<\/p>\n<\/div>\n<div>\n<p data-block-key=\"dfccn\"><b>2. Drop-in Native API Servers (litert-lm serve):<\/b> Run Gemma 4 12B as an area, OpenAI-compatible API server utilizing the brand new litert-lm serve <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/edge\/litert-lm\/cli\"><b>CLI command<\/b><\/a><b>.<\/b> Seamlessly join normal integrations (e.g., Proceed, Aider, OpenClaw, Hermes or OpenCode), leveraging stateless prefix caching in reminiscence to match context historical past and immediately bypass prefill latency.<\/p>\n<\/div>\n<div>\n<pre><code class=\"language-shell\">litert-lm import --from-huggingface-repo=litert-community\/gemma-4-12B-it-litert-lm  gemma-4-12B-it.litertlm gemma4-12b&#13;\n&#13;\n# Begin the OpenAI-compatible server&#13;\nlitert-lm serve<\/code><\/pre>\n<p>\n        Shell\n    <\/p>\n<\/div>\n<div>\n<p data-block-key=\"dfccn\">Discover a deep dive about it on the Google AI Edge Gallery <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/developers.googleblog.com\/bringing-gemma-4-12b-to-your-laptop-unlocking-local-agentic-workflows-with-google-ai-edge\">weblog<\/a>.<\/p>\n<h2 data-block-key=\"ld333\" id=\"getting-started-today\"><b>Getting Began At the moment<\/b><\/h2>\n<p data-block-key=\"2nt2v\">Able to construct native multimodal brokers with the primary encoder-free structure of the Gemma household? Right here is how one can bounce in at this time<\/p>\n<ul>\n<li data-block-key=\"142u3\"><b>Strive it your self<\/b>: Experiment with a few clicks in <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/lmstudio.ai\/models\/gemma-4\">LM Studio<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ollama.com\/library\/gemma4\">Ollama<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/developers.google.com\/edge\/gallery\">Google AI Edge Gallery App<\/a>, the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/edge\/eloquent\">Google AI Edge Eloquent<\/a> app and the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/edge\/litert-lm\/cli\">LiteRT-LM CLI<\/a><\/li>\n<li data-block-key=\"1riop\"><b>Obtain the weights<\/b>: Obtain the pre-trained and instruction-tuned checkpoints straight from <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/collections\/google\/gemma-4\">Hugging Face<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.kaggle.com\/models\/google\/gemma-4\">Kaggle<\/a>.<\/li>\n<li data-block-key=\"fv5um\"><b>Combine &amp; be taught:<\/b> Assessment the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/gemma\/docs\/core\">developer documentation<\/a> and the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/ai.google.dev\/gemma\/docs\/capabilities\/text\/basic\">fast begin pocket book<\/a>.<\/li>\n<li data-block-key=\"5fjdj\"><b>Use your favourite growth instruments<\/b>: Implement native inference pipelines with <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/google\/gemma-4-12B-it\">Hugging Face Transformers<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/collections\/ggml-org\/gemma-4\">llama.cpp<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/huggingface.co\/collections\/mlx-community\/gemma-4\">MLX<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.sglang.io\/cookbook\/autoregressive\/Google\/Gemma4\">SGLang<\/a>, and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.vllm.ai\/projects\/recipes\/en\/latest\/Google\/Gemma4.html\">vLLM<\/a>, or fine-tune with effectivity utilizing <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/unsloth.ai\/docs\/models\/gemma-4\">Unsloth<\/a>.<\/li>\n<li data-block-key=\"65l5\"><b>Unlock Agentic Growth with Gemma Abilities:<\/b> To help brokers to construct with the most recent Gemma developments, we&#8217;re releasing our official <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/google-gemma\/gemma-skills\">Abilities Repository<\/a>. It is a library of abilities designed particularly to allow brokers to construct with Gemma fashions.<\/li>\n<li data-block-key=\"8m2rg\"><b>Deploy your method:<\/b> Spin up endpoints in manufacturing utilizing Google Cloud. Deploy your method by way of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/console.cloud.google.com\/agent-platform\/publishers\/google\/model-garden\/gemma4;publisherModelVersion=gemma-4-12b-it\">Gemini Enterprise Agent Platform Mannequin Backyard<\/a>, <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/codelabs.developers.google.com\/codelabs\/cloud-run\/cloud-run-gpu-rtx-pro-6000-gemma4-vllm\">Cloud Run<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.cloud.google.com\/kubernetes-engine\/docs\/tutorials\/serve-gemma-gpu-vllm\">GKE<\/a>.<\/li>\n<\/ul>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Following the announcement in our launch weblog, we&#8217;re releasing Gemma 4 12B, a dense multimodal mannequin with a unified, encoder-free structure. Gemma 4 12B introduces a number of milestones for native AI: A multimodal encoder-free structure: Bypassing heavy multi-stage imaginative and prescient and audio encoders totally, multimodal knowledge is fed straight into the LLM spine, [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":15723,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[9319,1217,1456,78],"class_list":["post-15721","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-12b","tag-developer","tag-gemma","tag-guide"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15721"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721\/revisions"}],"predecessor-version":[{"id":15722,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721\/revisions\/15722"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15723"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-06-14 11:33:01 UTC -->