MediaTek NPU and LiteRT: Powering the subsequent era of on-device AI

The Neural Processing Unit (NPU) has turn out to be the important enabler for the subsequent era of on-device AI. By delivering most efficiency of tens of TOPS (Tera Operations Per Second) with minimal energy consumption, NPUs enable gadgets to run subtle, computationally heavy generative AI fashions that have been beforehand unimaginable on normal edge gadgets.

Sorry, your browser does not help playback for this video

Good grouping powered by on-device giant language mannequin operating on MediaTek Kompanio Extremely NPU with Chromebook Plus 14

These highly effective NPUs are the engine behind an enormous, various ecosystem of merchandise, from flagship smartphones, laptops, and tablets, to good residence hubs and IoT gadgets. Nonetheless, deploying AI on NPUs has usually been troublesome, hindering broad adoption. The NPU area is very various, with a whole bunch of SoC variants concentrating on totally different system sorts, creating vital hurdles for builders to handle compilers and distribute runtimes. Present on-device ML infrastructure is usually tailor-made for CPUs and GPUs, missing deep integration with specialised NPU SDKs and their distinctive compilation wants. This has resulted in complicated, ad-hoc deployment workflows. Furthermore, enabling subtle GenAI fashions operating effectively on NPUs requires superior optimization and particular kernels, going far past easy operator delegation.

Along with MediaTek, we’re excited to announce the brand new LiteRT NeuroPilot Accelerator, to assist builders overcome these modifications. It is a ground-up successor for the TFLite NeuroPilot delegate, bringing seamless deployment expertise, state-of-the-art LLM help, and superior efficiency to thousands and thousands of gadgets worldwide.

Key options of the LiteRT NeuroPilot Accelerator

Shifting nicely past fundamental acceleration, the LiteRT NeuroPilot Accelerator supplies a unified improvement workflow and complex options designed to productionize AI on MediaTek NPUs. Listed below are the highlights:

Seamless and unified deployment workflow: The accelerator supplies quick access to varied MediaTek NPUs through a unified API, abstracting away SDK complexities. You may select between two distinct compilation workflows: offline (Forward-of-Time, a.okay.a. AOT) and on-line (on-device), supplying you with the pliability to decide on the very best technique in your software, whether or not it is minimizing first-run latency or enabling platform-agnostic mannequin distribution.
Wealthy generative AI capabilities: Our collaboration with MediaTek unlocks the total potential of state-of-the-art fashions just like the Gemma household. This allows constructing and deploying subtle generative AI options, from superior textual content era to new multimodal functions, straight on NPU.
Environment friendly, cross-platform improvement: We’ve launched a brand new, simplified C++ API (an enchancment on the earlier C API) that makes constructing extremely environment friendly ML pipelines simpler. This new API works seamlessly with Native {Hardware} Buffer Interoperability, permitting for zero-copy information passing from AHardwareBuffer on to the NPU, in addition to computerized conversion from OpenGL/OpenCL buffers. That is important for constructing high-throughput, real-time digicam and video functions.

Seamless and unified deployment workflow

Historically, builders wanted to construct for varied combos of SoC suppliers and SoC variations and needed to handle the distribution of compiled fashions and runtimes for every mixture. To unravel this, we have now created a easy, 3-step workflow to get your fashions operating with NPU acceleration.

The total, detailed information with a colab and pattern app, is obtainable on our LiteRT NPU documentation. Right here is the high-level course of:

Step 1: AOT Compilation for the goal SoCs (non-obligatory) . You merely use the LiteRT Python library to compile your .tflite mannequin to the supported SoCs. See extra particulars on this LiteRT AOT Compilation Tutorial. Whereas non-obligatory, AOT compilation is very really helpful for bigger fashions to cut back on-device initialization time. This step is just not required for on-device compilation.
Step 2: Deploy with Google Play for On-device AI (PODAI) if on Android. Use LiteRT to export the mannequin belongings and required runtime libraries into an “AI Pack”, the format utilized by PODAI. Copy the AI Pack to your Android app challenge. When customers set up your app from Google Play, it analyzes the consumer’s system and robotically delivers the mannequin and runtime to a suitable system.
Step 3: Inference utilizing LiteRT Runtime. LiteRT abstracts away the complexity of {hardware} fragmentation. For each AOT and on-device compilation, you merely load the mannequin and specify Accelerator.NPU within the choices. LiteRT handles the remaining, and even features a strong fallback mechanism: you may specify GPU or CPU as secondary choices, and LiteRT will robotically use them if the NPU is unavailable.

AOT and on-device compilation

With the brand new LiteRT NeuroPilot Accelerator, we’ve moved from a high-level wrapper to a direct, native integration with the NeuroPilot compiler and runtime. This allows a robust Forward-of-Time (AOT) compilation workflow that was beforehand out of attain, giving builders flexibility of their deployment technique:

Offline (AOT) compilation: That is greatest fitted to giant, complicated fashions the place the goal SoC is thought. Compiling ahead-of-time considerably reduces initialization prices and lowers reminiscence utilization when the consumer launches your app.
On-line (on-device) compilation: That is very best for platform-agnostic mannequin distribution of small fashions. The mannequin is compiled on the consumer’s system throughout initialization, requiring no further preparation step however incurring the next first-run price.

Right here’s how the 2 approaches examine for a big mannequin (e.g., Gemma 3 270M). As proven, on-device compilation for such a big mannequin can take over a minute, making AOT the extra sensible selection for manufacturing.

Wealthy generative AI capabilities with Gemma and different open-weight fashions

On supported Android gadgets you should use Gemini Nano by way of ML Package. For markets the place Gemini Nano is just not supported or you probably have use instances that require deeper customization, we now unlock the total potential of open-weight fashions. This contains Google’s Gemma mannequin household, a set of light-weight, cutting-edge open fashions from Google which are optimized particularly for on-device use instances.

As introduced at MediaTek’s latest Dimensity 9500 occasion, our collaboration brings optimized, production-ready help for the next fashions on their newest chipsets:

Qwen3 0.6B: Basis fashions that energy new AI experiences in Mainland China by OEMs like Xiaomi, Huawei, and Vivo.
Gemma 3 270M: A hyper-efficient and compact base mannequin designed for task-specific put up fine-tuning, enabling high-speed, low-latency options like sentiment evaluation or entity extraction in resource-constrained environments.
Gemma 3 1B: A light-weight and multilingual text-only mannequin that balances compact measurement with sturdy generative capabilities, making it very best for a variety of on-device reasoning, summarization, and content material creation duties.
Gemma 3n E2B: A mobile-first, highly effective multimodal mannequin that natively understands audio, imaginative and prescient, and textual content, purpose-built for low-latency functions like real-time speech translation and visible understanding.
EmbeddingGemma 300M: A state-of-the-art textual content embedding mannequin that produces high-quality embeddings on-device, nice for Retrieval Augmented Technology (RAG), semantic search, and classification.

Powered by particular optimizations concentrating on the MediaTek NPU, Gemma fashions are accelerated by as much as 12x in comparison with CPU, and 10x in comparison with GPU. This delivers impressively quick inference, as proven within the efficiency benchmarks for Gemma and Qwen on the most recent MediaTek Dimensity 9500 with Vivo X300 Professional:

Because the outcomes present, the Gemma 3n E2B mannequin achieves over 1600 tokens/sec for prefill and 28 tokens/sec for decode (with 4K context) on the NPU. This velocity permits subtle multimodal use instances.

Sorry, your browser does not help playback for this video

An actual-time, on-device Chinese language assistant with imaginative and prescient & audio multimodality, powered by Gemma 3n 2B. Working on Vivo 300 Professional with the MediaTek Dimensity 9500 NPU. (Left) Recognizing a dish and offering cooking directions. (Center) Figuring out a plant and suggesting care ideas. (Proper) Producing a one-day itinerary for San Francisco.

The way to deploy Gemma

To get began, you could find pre-compiled Gemma fashions for MediaTek NPU on the LiteRT HuggingFace Neighborhood. We offer two main paths for integration, and pathways for each C/C++ and Kotlin/Java customers.

1. For textual content era (e.g., Gemma 3 270M) utilizing LiteRT-LM: constructed on prime of LiteRT, LiteRT-LM supplies a high-level, stateful “text-in, text-out” API that simplifies inference with textual content generative fashions.

// 1. Outline mannequin belongings and engine settings.
auto model_assets = ModelAssets::Create(model_path);
auto engine_settings = EngineSettings::CreateDefault(
    model_assets, litert::lm::Backend::NPU); // Specify inference on NPU.

// 2. Create the principle Engine object. This masses the mannequin.
absl::StatusOr<:unique_ptr>> engine = Engine::CreateEngine(engine_settings);

// 3. Create a Session for a brand new dialog.
auto session_config = SessionConfig::CreateDefault();
absl::StatusOr<:unique_ptr>> session = (*engine)->CreateSession(session_config);

// 4. Generate content material utilizing a high-level API.
absl::StatusOr responses = (*session)->GenerateContent(
    {InputText("What's the tallest constructing on the earth?")});

// 5. Print the response.
std::cout << *responses << std::endl;

C++

See the directions from LiteRT-LM documentation for extra particulars on setup MediaTek NeuroPilot and API utilization for C++ and Kotlin.

2. For EmbeddingGemma, use LiteRT: EmbeddingGemma matches completely with LiteRT’s “tensor-in, tensor-out” API.

// 1. Arrange inference choices
auto env = Surroundings::Create({dispatch_options});
auto embedder_model_def = Mannequin::CreateFromFile(embedder_path);
auto choices = Choices::Create();
options->SetHardwareAccelerators(kLiteRtHwAcceleratorNpu);

// 2. Create LiteRT CompiledModel
LITERT_ASSIGN_OR_RETURN(auto embedder_model,
    CompiledModel::Create(*env, *embedder_model_def, *choices));
LITERT_ASSIGN_OR_RETURN(auto input_buffers, embedder_model->CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, embedder_model->CreateOutputBuffers());

// 3. Inference with inputs
LITERT_RETURN_IF_ERROR(input_buffers[0].Write(token_ids));
LITERT_RETURN_IF_ERROR(
    embedder_model->Run(input_buffers, output_buffers));
LITERT_RETURN_IF_ERROR(output_buffers[0].Learn(output_embeddings));

C++

See additionally the total directions of C++ and Kotlin improvement from the LiteRT Documentation. An end-to-end instance is obtainable from the LiteRT Semantic Similarity demo app.

We’ll quickly help changing a customized Gemma mannequin for MediaTek NPU through LiteRT, and extra NPU demos will probably be obtainable on AI Edge Gallery quickly.

Environment friendly, cross-platform improvement

To make constructing wealthy, real-time functions simpler throughout kinds of platforms and gadgets, we’ve targeted on bettering the developer expertise and information pipeline effectivity. This begins with a brand new, simplified C++ API. That is an enchancment on the earlier C API and makes it simpler to construct environment friendly, cross-platform ML functions.

Our new API was designed to work seamlessly with native {hardware} buffers. The accelerator now helps Native {Hardware} Buffer Interoperability, which permits two key efficiencies. First, it permits for zero-copy information passing with AHardwareBuffer. Second, it supplies zero-copy interop between OpenGL/OpenCL buffers, widespread inputs/outputs of GPU picture processing, and AHardwareBuffer. As a substitute of changing enter/output information to and from the CPU, you may cross digicam frames or video straight from different ML pipeline elements to NPU through LiteRT. That is important for constructing the high-throughput, real-time digicam and video functions which are a key purpose of this launch.

Right here is an instance of GPU pre-processing adopted by NPU inference with buffer interop help in LiteRT:

// Outline a LiteRT surroundings to make use of current EGL show and context.
const std::vector<:option> environment_options = {
   {OptionTag::EglDisplay, user_egl_display},
   {OptionTag::EglContext, user_egl_context}};
auto env = Surroundings::Create(absl::MakeConstSpan(environment_options));

// Load Mannequin and initialize NPU runtime. 
LITERT_ASSIGN_OR_RETURN(auto mannequin, Mannequin::CreateFromFile("mannequin.tflite"));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, mannequin, HwAccelerator::kNpu));

// Put together I/O buffers.
LITERT_ASSIGN_OR_RETURN(RankedTensorType tensor_type, mannequin.GetInputTensorType("input_name0"));
// Create an enter TensorBuffer straight from an OpenGL SSBO (GL Buffer).  
LITERT_ASSIGN_OR_RETURN(auto tensor_buffer_from_opengl, TensorBuffer::CreateFromGlBuffer(env, tensor_type, GL_SHADER_STORAGE_BUFFER, gl_buffer_id, size_bytes, offset));
std::vector input_buffers;
input_buffers.push_back(std::transfer(tensor_buffer_from_opengl));

// Create an output TensorBuffer of the mannequin. 
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());

// Run inference. 
compiled_model.Run(input_buffers, output_buffers);

C++

See extra directions within the LiteRT C++ API documentation, and the LiteRT Async Segmentation C++ demo app.

Wanting forward

LiteRT now makes it straightforward to deliver NPU-accelerated ML to thousands and thousands of MediaTek gadgets by way of LiteRT NeuroPilot Accelerator, dramatically bettering the consumer expertise for an enormous world viewers.

LiteRT NPU help is now obtainable to all builders. We encourage you to strive it out at present! Try our instance Colab, discover the Pattern App, and dive into the official LiteRT Devsite for documentation and guides.

Acknowledgements

Particular because of the Google ODML crew and MediaTek crew for his or her vital contributions on this effort:

Google ODML crew: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Teng-Hui Zhu, Terry (Woncheol) Heo, Vitalii Dziuba, Weiyi Wang, Yu-Hui Chen, Zichuan Wei.

MediaTek crew: Bo-Yan Lin, Chao-Yuan Lee, Cheng-Yen Lin, Chia-Lin Yu, Chiayu Sung, Christoph Kuo, Chuo-Ling Chang, Deep Yap, Hsienkai Kuo, HungChun Liu, Jush Lu, Kayden Yang, Lei Chen, Peng-Wen Chen, Poyuan Jeng, Tzu-hsuan Wei, Waimun Wong, Wen-Li Shih, YanRen Chang, Yi-Min Tsai, Yu-Chieh Lin, Yu-Ting Wan.