Gemma 4 12B: The Developer Information

Following the announcement in our launch weblog, we’re releasing Gemma 4 12B, a dense multimodal mannequin with a unified, encoder-free structure.

Gemma 4 12B introduces a number of milestones for native AI:

A multimodal encoder-free structure: Bypassing heavy multi-stage imaginative and prescient and audio encoders totally, multimodal knowledge is fed straight into the LLM spine, lowering multimodal latency.
Our first medium-sized mannequin with audio enter: Within the Gemma household, audio inputs had been restricted to small, light-weight edge architectures (e.g. E4B). Gemma 4 12B is the primary medium-sized mannequin able to natively ingesting audio.
Developer-friendly dimension: Sufficiently small to run domestically on devoted GPU laptops with 16GB VRAM or unified reminiscence. To maximise native inference speeds, we’re moreover releasing a devoted multi-token prediction (MTP) mannequin.
New MacOS desktop expertise: For the primary time, we’re releasing downloadable macOS desktop purposes, letting builders expertise totally native spoken and visible interplay straight on consumer-grade gadgets.

The Structure

Conventional multimodal fashions depend on frozen, separate imaginative and prescient encoders (e.g., Gemma 4 makes use of a 150M parameter imaginative and prescient mannequin for edge sizes and 550M for medium-sized fashions) and audio encoders (300M parameters for Gemma 4 E2B and E4B). Processing multimodal inputs with a number of separate encoders earlier than feeding them to the LLM results in elevated latency and fragmented reminiscence footprints.

Gemma 4 12B solves these points by using a single decoder-only transformer containing the identical superior decoder construction because the Gemma 4 31B Dense mannequin.

Imaginative and prescient embedder (35M parameters): Replaces the 27 imaginative and prescient transformer layers of the opposite medium-sized Gemma 4 fashions. Uncooked 48×48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location data on to the enter.
Audio wave projection: Eliminates the separate audio encoder (skipping the 12 conformer layers utilized in Gemma 4 E2B and E4B). Uncooked 16 kHz audio indicators are sliced into 40ms frames (640 floats every) and projected linearly to the LLM enter house.
Unified fine-tuning benefit: As a result of imaginative and prescient, audio, and textual content inputs share the very same weights, you not must co-tune separate frozen encoders. Downstream adapter (e.g. LoRA) or full tuning naturally replace your entire multimodal token loop in a single move (by way of Hugging Face or Unsloth).

For a extra in-depth overview of how this encoder-free structure works, try A Visible Information to Gemma 4 12B.

Capabilities

Gemma 4 12B achieves excellent efficiency, with capabilities equivalent to computerized speech recognition, agentic reasoning, diarization, video understanding, coding, and extra.

See under examples for an illustration of the mannequin’s agentic and multimodal capabilities:

Instance 1: Gemma 4 12B creates an area picture processing app that makes use of Gemma 4 12B

Due to its agentic and multimodal understanding capabilities, Gemma 4 12B might be simply used with present agent harnesses like OpenCode. On this instance, we served it domestically utilizing llama.cpp utilizing the gemma-skills to code a Gradio app that helped the consumer course of pictures. This app was powered by the identical Gemma 4 12B mannequin that constructed it!

Instance 2: Processing 5 minutes of Video at 1 FPS with audio

We used Gemma 4 12B to analyse an element from the Google IO Keynote second from Might 19, particularly the 5 minutes between 00:15:32 and 00:20:45. To try this, we extracted all of the frames of that phase (at 1 FPS), plus the immediate and the audio from the video:

Immediate:

313 frames (at 1FPS, pictures resized to visible token finances 70)
"what occurs when the person takes a selfie?"
the Video’s audio + the immediate under

When the person takes a “selfie” or is proven holding a smartphone in entrance of his face in these demonstration movies, it’s a intelligent method to visually signify how the AI fashions can take present media—like a private photograph or a video clip—and “reimagine” it. In these particular clips, the mannequin is taking a selfie and utilizing it as a base to generate varied situations (like an individual in an area station, or strolling by way of a forest).

Basically, the person isn’t really taking a selfie; relatively, he’s **performing out a visible metaphor for the AI’s functionality to take one particular enter (a “selfie”) and generate a complete world of latest content material primarily based on it.** That is a part of the “Swap” and “Construct worlds” demonstrations of the Gemini Omni mannequin, displaying its skill to carry out advanced, multi-modal reasoning and artistic era.

On-Machine & Desktop Serving: Powered by LiteRT-LM

In tandem with the Gemma 4 12B launch, we’re formally introducing highly effective on-device developer integrations powered by LiteRT-LM, bringing zero-latency native AI execution natively to plain desktop environments:

1.Native MacOS Apps: The cellular Google AI Edge Gallery is formally increasing to desktop platforms, operating Gemma 4 12B offline, natively on Apple Silicon GPUs. It comes with a safe sandboxed Python execution loop to write down, execute, and plot scientific charts contained in the chat bubble. In parallel, the Google AI Edge Eloquent app on Mac launches help for Gemma 12B to energy Voice Edit conversational inputs.

2. Drop-in Native API Servers (litert-lm serve): Run Gemma 4 12B as an area, OpenAI-compatible API server utilizing the brand new litert-lm serve CLI command. Seamlessly join normal integrations (e.g., Proceed, Aider, OpenClaw, Hermes or OpenCode), leveraging stateless prefix caching in reminiscence to match context historical past and immediately bypass prefill latency.

litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm  gemma-4-12B-it.litertlm gemma4-12b

# Begin the OpenAI-compatible server
litert-lm serve

Shell

Discover a deep dive about it on the Google AI Edge Gallery weblog.

Getting Began At the moment

Able to construct native multimodal brokers with the primary encoder-free structure of the Gemma household? Right here is how one can bounce in at this time

Strive it your self: Experiment with a few clicks in LM Studio, Ollama, Google AI Edge Gallery App, the Google AI Edge Eloquent app and the LiteRT-LM CLI
Obtain the weights: Obtain the pre-trained and instruction-tuned checkpoints straight from Hugging Face and Kaggle.
Combine & be taught: Assessment the developer documentation and the fast begin pocket book.
Use your favourite growth instruments: Implement native inference pipelines with Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM, or fine-tune with effectivity utilizing Unsloth.
Unlock Agentic Growth with Gemma Abilities: To help brokers to construct with the most recent Gemma developments, we’re releasing our official Abilities Repository. It is a library of abilities designed particularly to allow brokers to construct with Gemma fashions.
Deploy your method: Spin up endpoints in manufacturing utilizing Google Cloud. Deploy your method by way of Gemini Enterprise Agent Platform Mannequin Backyard, Cloud Run and GKE.