Following the announcement in our launch weblog<\/a>, we’re releasing Gemma 4 12B<\/b>, a dense multimodal mannequin with a unified, encoder-free structure<\/b>.<\/p>\n

Gemma 4 12B introduces a number of milestones for native AI:<\/p>\n

\n
A multimodal encoder-free structure:<\/b> Bypassing heavy multi-stage imaginative and prescient and audio encoders totally, multimodal knowledge is fed straight into the LLM spine, lowering multimodal latency.<\/li>\n
Our first medium-sized mannequin with audio enter:<\/b> Within the Gemma household, audio inputs had been restricted to small, light-weight edge architectures (e.g. E4B). Gemma 4 12B is the primary medium-sized mannequin able to natively ingesting audio.<\/li>\n

Developer-friendly dimension<\/b>: Sufficiently small to run domestically on devoted GPU laptops with 16GB VRAM or unified reminiscence. To maximise native inference speeds, we’re moreover releasing a devoted multi-token prediction (MTP)<\/a> mannequin.<\/li>\n
New MacOS desktop expertise<\/b>: For the primary time, we’re releasing downloadable macOS desktop purposes, letting builders expertise totally native spoken and visible interplay straight on consumer-grade gadgets.<\/li>\n<\/ol>\n
The Structure<\/b><\/h2>\n
Conventional multimodal fashions depend on frozen, separate imaginative and prescient encoders (e.g., Gemma 4 makes use of a 150M parameter imaginative and prescient mannequin for edge sizes and 550M for medium-sized fashions) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with a number of separate encoders earlier than feeding them to the LLM results in elevated latency and fragmented reminiscence footprints.<\/p>\n
Gemma 4 12B solves these points by using a single decoder-only transformer containing the identical superior decoder construction because the Gemma 4 31B Dense mannequin.<\/p>\n<\/div>\n
\n
\n
Imaginative and prescient embedder (35M parameters):<\/b> Replaces the 27 imaginative and prescient transformer layers of the opposite medium-sized Gemma 4 fashions. Uncooked 48×48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location data on to the enter.<\/li>\n
Audio wave projection:<\/b> Eliminates the separate audio encoder (skipping the 12 conformer layers utilized in Gemma 4 E2B and E4B). Uncooked 16 kHz audio indicators are sliced into 40ms frames (640 floats every) and projected linearly to the LLM enter house.<\/li>\n
Unified fine-tuning benefit:<\/b> As a result of imaginative and prescient, audio, and textual content inputs share the very same weights, you not must co-tune separate frozen encoders<\/b>. Downstream adapter (e.g. LoRA) or full tuning naturally replace your entire multimodal token loop in a single move (by way of Hugging Face or Unsloth).<\/li>\n<\/ul>\n
For a extra in-depth overview of how this encoder-free structure works, try A Visible Information to Gemma 4 12B<\/a>.<\/p>\n
Capabilities<\/b><\/h2>\n
Gemma 4 12B achieves excellent efficiency, with capabilities equivalent to computerized speech recognition, agentic reasoning, diarization, video understanding, coding, and extra.<\/p>\n
See under examples for an illustration of the mannequin’s agentic and multimodal capabilities:<\/p>\n
Instance 1: Gemma 4 12B creates an area picture processing app that makes use of Gemma 4 12B<\/b><\/h2>\n<\/div>\n\n
Due to its agentic and multimodal understanding capabilities, Gemma 4 12B might be simply used with present agent harnesses like OpenCode. On this instance, we served it domestically utilizing llama.cpp utilizing the gemma-skills<\/a> to code a Gradio app that helped the consumer course of pictures. This app was powered by the identical Gemma 4 12B mannequin that constructed it!<\/p>\n
Instance 2: Processing 5 minutes of Video at 1 FPS with audio<\/b><\/h2>\nWe used Gemma 4 12B to analyse an element from the Google IO Keynote second<\/a> from Might 19, particularly the 5 minutes between 00:15:32 and 00:20:45. To try this, we extracted all of the frames of that phase (at 1 FPS), plus the immediate and the audio from the video:<\/p>\n<\/div>\n
\n
Immediate:<\/p>\n
\n
313 frames (at 1FPS, pictures resized to visible token finances<\/a> 70)<\/li>\n
\"what occurs when the person takes a selfie?\"<\/code><\/li>\n
the Video’s audio + the immediate under<\/li>\n<\/ol>\nWhen the person takes a “selfie” or is proven holding a smartphone in entrance of his face in these demonstration movies, it’s a intelligent method to visually signify how the AI fashions can take present media\u2014like a private photograph or a video clip\u2014and “reimagine” it. In these particular clips, the mannequin is taking a selfie and utilizing it as a base to generate varied situations (like an individual in an area station, or strolling by way of a forest).<\/p>\n Basically, the person isn’t really taking a selfie; relatively, he’s **performing out a visible metaphor for the AI’s functionality to take one particular enter (a “selfie”) and generate a complete world of latest content material primarily based on it.** That is a part of the “Swap” and “Construct worlds” demonstrations of the Gemini Omni mannequin, displaying its skill to carry out advanced, multi-modal reasoning and artistic era.<\/p>\n<\/blockquote>\nOn-Machine & Desktop Serving: Powered by LiteRT-LM<\/b><\/h2>\nIn tandem with the Gemma 4 12B launch, we’re formally introducing highly effective on-device developer integrations powered by LiteRT-LM, bringing zero-latency native AI execution natively to plain desktop environments:<\/p>\n1.Native MacOS Apps<\/b>: The cellular Google AI Edge Gallery<\/b><\/a> is formally increasing to desktop platforms, operating Gemma 4 12B offline, natively on Apple Silicon GPUs. It comes with a safe sandboxed Python execution loop to write down, execute, and plot scientific charts contained in the chat bubble. In parallel, the Google AI Edge Eloquent<\/b><\/a> app on Mac launches help for Gemma 12B to energy Voice Edit conversational inputs.<\/p>\n<\/div>\n \n2. Drop-in Native API Servers (litert-lm serve):<\/b> Run Gemma 4 12B as an area, OpenAI-compatible API server utilizing the brand new litert-lm serve CLI command<\/b><\/a>.<\/b> Seamlessly join normal integrations (e.g., Proceed, Aider, OpenClaw, Hermes or OpenCode), leveraging stateless prefix caching in reminiscence to match context historical past and immediately bypass prefill latency.<\/p>\n<\/div>\n \nlitert-lm import --from-huggingface-repo=litert-community\/gemma-4-12B-it-litert-lm gemma-4-12B-it.litertlm gemma4-12b \n \n# Begin the OpenAI-compatible server \nlitert-lm serve<\/code><\/pre>\n\n Shell\n <\/p>\n<\/div>\n\nDiscover a deep dive about it on the Google AI Edge Gallery weblog<\/a>.<\/p>\n Getting Began At the moment<\/b><\/h2>\nAble to construct native multimodal brokers with the primary encoder-free structure of the Gemma household? Right here is how one can bounce in at this time<\/p>\n\nStrive it your self<\/b>: Experiment with a few clicks in LM Studio<\/a>, Ollama<\/a>, Google AI Edge Gallery App<\/a>, the Google AI Edge Eloquent<\/a> app and the LiteRT-LM CLI<\/a><\/li>\n Obtain the weights<\/b>: Obtain the pre-trained and instruction-tuned checkpoints straight from Hugging Face<\/a> and Kaggle<\/a>.<\/li>\n Combine & be taught:<\/b> Assessment the developer documentation<\/a> and the fast begin pocket book<\/a>.<\/li>\n Use your favourite growth instruments<\/b>: Implement native inference pipelines with Hugging Face Transformers<\/a>, llama.cpp<\/a>, MLX<\/a>, SGLang<\/a>, and vLLM<\/a>, or fine-tune with effectivity utilizing Unsloth<\/a>.<\/li>\n Unlock Agentic Growth with Gemma Abilities:<\/b> To help brokers to construct with the most recent Gemma developments, we’re releasing our official Abilities Repository<\/a>. It is a library of abilities designed particularly to allow brokers to construct with Gemma fashions.<\/li>\n Deploy your method:<\/b> Spin up endpoints in manufacturing utilizing Google Cloud. Deploy your method by way of Gemini Enterprise Agent Platform Mannequin Backyard<\/a>, Cloud Run<\/a> and GKE<\/a>.<\/li>\n<\/ul>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"Following the announcement in our launch weblog, we’re releasing Gemma 4 12B, a dense multimodal mannequin with a unified, encoder-free structure. Gemma 4 12B introduces a number of milestones for native AI: A multimodal encoder-free structure: Bypassing heavy multi-stage imaginative and prescient and audio encoders totally, multimodal knowledge is fed straight into the LLM spine, […]<\/p>\n","protected":false},"author":2,"featured_media":15723,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[9319,1217,1456,78],"class_list":["post-15721","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-12b","tag-developer","tag-gemma","tag-guide"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=15721"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721\/revisions"}],"predecessor-version":[{"id":15722,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/15721\/revisions\/15722"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/15723"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=15721"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=15721"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=15721"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}