Trendy smartphones characteristic refined SoCs (system on a chip), composed of CPU, GPU, and NPU, which may allow compelling, on-device GenAI experiences which are considerably extra interactive and real-time than their server-only counterparts. The GPU is essentially the most ubiquitous accelerator for AI duties, with GPU compute being obtainable on roughly 90% of all Android units. Nevertheless, solely counting on it could possibly create efficiency bottlenecks, particularly when constructing advanced, interactive GenAI experiences. Think about the next setting: operating a compute-intensive, text-to-image technology mannequin on-device, whereas concurrently processing the dwell digital camera feed with an ML-based segmentation. Even essentially the most highly effective cell GPU will battle beneath this mixed load, leading to jarring body drops and a damaged person expertise.
Efficiency bottleneck with full GPU inference (left), and clean person experiences with NPU/GPU parallel processing (proper). Captured on Samsung Galaxy S25 Extremely powered by QC Snapdragon 8 Elite.
That is the place the NPU (Neural Processing Unit) is available in. It’s a extremely specialised processor that gives tens of TOPS (Tera Operations Per Second) of devoted AI compute, excess of a contemporary, cell GPU can maintain. Crucially, it’s considerably extra power-efficient per TOP than each CPUs and GPUs, which is important for battery-operated units like cellphones. The NPU is now not a distinct segment characteristic; it is a typical element, with over 80% of current Qualcomm SoCs now together with one. The NPU runs parallel to the GPU and CPU, enabling the heavy AI processing. This concurrency frees the GPU to give attention to rendering and the CPU on main-thread logic. This contemporary structure unlocks the sleek, responsive, and quick efficiency that fashionable AI purposes demand.
Introducing LiteRT Qualcomm AI Engine Direct Accelerator
To convey this NPU energy to LiteRT, Google’s high-performance on-device ML framework, we’re thrilled to announce a big leap ahead: the LiteRT Qualcomm AI Engine Direct (QNN) Accelerator, developed in shut collaboration with Qualcomm, changing the earlier TFLite QNN delegate.
This replace introduces two main benefits for builders:
1. A unified and simplified cell deployment workflow that frees Android app builders from the most important complexities of NPU acceleration. You now not have to:
- Work together with low-level, vendor-specific SDKs: LiteRT integrates with SoC compilers and runtimes and exposes them by means of a unified, streamlined developer-facing API.
- Goal particular person SoC variations: LiteRT abstracts away fragmentation throughout SoCs, offering a unified workflow to scale the deployment to a number of SoCs on the similar time.
Now you can deploy your mannequin seamlessly throughout all supported units, with both ahead-of-time (AOT) or on-device compilation. This makes integrating pre-trained .tflite fashions in manufacturing from sources like Qualcomm AI Hub simpler than ever.
2. State-of-the-Artwork on-device efficiency. The accelerator helps an in depth vary of LiteRT ops, enabling most NPU utilization and full mannequin delegation, a crucial issue for securing the very best efficiency. Moreover, it’s full of the specialised kernels and optimizations required for classy LLMs and GenAI fashions, reaching SOTA efficiency for fashions like Gemma and FastVLM.
Superior efficiency, real-world outcomes
We benchmarked the brand new LiteRT QNN accelerator throughout 72 canonical ML fashions, encompassing imaginative and prescient, audio, and NLP domains. The outcomes present a large bounce in uncooked efficiency: the NPU acceleration offers as much as a 100x speedup over CPU and a 10x speedup over GPU. Our new accelerator allows this by supporting 90 LiteRT ops, permitting 64 of the 72 fashions to delegate totally to the NPU.
This pace interprets to actual interactive efficiency. On Qualcomm’s newest flagship SoC, the Snapdragon 8 Elite Gen 5, the efficiency profit is substantial: over 56 fashions run in beneath 5ms with the NPU, whereas solely 13 fashions obtain that on the CPU. This unlocks a bunch of dwell AI experiences that had been beforehand unreachable.
Here’s a number of 20 consultant fashions from the benchmark:
Determine: LiteRT inference latency measured on Snapdragon 8 Elite Gen 5 powering the Xiaomi 17 Professional Max. The values are normalized to the CPU baseline (100%), demonstrating important speedups, with GPU lowering latency to ~5–70% and NPU lowering latency to ~1–20%.
Unlocking the total energy of NPU for LLM inference
The LiteRT QNN Accelerator reveals cutting-edge efficiency with refined LLMs. To exhibit this, we benchmarked the FastVLM-0.5B analysis mannequin, a state-of-the-art imaginative and prescient mannequin for on-device AI, utilizing LiteRT for each AOT compilation and on-device NPU inference.
The mannequin is optimized with int8 weight quantization and int16 activation quantization. That is the important thing to unlocking the NPU’s strongest, high-speed int16 kernels. We additionally went past easy delegation and added particular NPU kernels for performance-critical transformer layers to the LiteRT QNN Accelerator, notably for the Consideration mechanism, making certain these layers run effectively.
This delivers a stage of efficiency that creates an AI expertise not often seen on cell units. Working on the Snapdragon 8 Elite Gen 5 NPU, our FastVLM integration delivers time-to-first-token (TTFT) in simply 0.12 second on high-resolution photos (1024×1024). It achieves over 11,000 tokens/sec for prefill and over 100 tokens/sec for decode. This excessive throughput is what makes a clean, real-time, interactive expertise potential. To showcase this, we constructed a dwell scene understanding demo that processes and describes the world round you.
Scene understanding utilizing FastVLM imaginative and prescient modality operating on Snapdragon 8 Elite Gen 5 with Xiaomi 17 Professional Max.
Getting began in 3 steps
Right here’s how easy it’s to deploy a .tflite mannequin on NPU throughout totally different Qualcomm SoC variations utilizing the unified workflow with LiteRT. Pre-trained production-quality .tflite fashions might be downloaded from sources like Qualcomm AI Hub.
Step 1: (non-compulsory) AOT Compilation for the goal SoCs with LiteRT
Whereas pre-compiling your .tflite mannequin offline (AOT) is non-compulsory, we extremely advocate it for giant fashions the place on-device compilation can lead to longer initialization occasions and better peak reminiscence consumption.
You may compile for all supported SoCs or goal particular SoC variations utilizing LiteRT on the host in a couple of traces of Python code:
from ai_edge_litert.aot import aot_compile as aot_lib
from ai_edge_litert.aot.distributors.qualcomm import goal as qnn_target
# --- Compile to all obtainable SoCs ---
compiled_models = aot_lib.aot_compile(tflite_model_path)
# --- Or, compile to particular Qualcomm SoC variations ---
# Instance: Focusing on Qualcomm Snapdragon 8 Elite Gen5 Cellular Platform (SM8850)
sm8850_target = qnn_target.Goal(qnn_target.SocModel.SM8850)
compiled_models = aot_lib.aot_compile(
tflite_model_path,
goal=[sm8850_target]
)
Python
After compilation, export your compiled fashions throughout goal SoCs right into a single Google Play AI Pack. You then add this pack to Google Play, which makes use of Play for On-device AI (PODAI) to robotically ship the right compiled fashions to every customers’ units.
from ai_edge_litert.aot.ai_pack import export_lib as ai_pack_export
# --- Export the AI Pack ---
# This bundles mannequin variants and metadata so Google Play can
# ship the right compiled mannequin to the best machine.
ai_pack_export.export(
compiled_models,
ai_pack_dir,
ai_pack_name,
litert_model_name
)
Python
See a full instance within the LiteRT AOT compilation pocket book.
Step 2: Deploy to the goal SoCs with Google Play for On-device AI
Add your mannequin to the Android app venture. You could have two distinct choices relying in your chosen workflow:
- For On-Machine compilation: Copy the unique .tflite mannequin file instantly into your app’s belongings/ listing.
- For AOT compilation: Copy your complete AI Pack from Step 1 into your venture’s root listing. You need to then add this AI Pack to your gradle configuration, as proven under:
// my_app/settings.gradle.kts
...
embody(":ai_pack:my_model")
// my_app/app/construct.gradle.kts
android {
...
assetPacks.add(":ai_pack:my_model")
}
Kotlin
Subsequent, run the script to fetch the QNN libraries. This downloads NPU runtime (for each AOT and on-device compilation) and the compiler library (important for on-device compilation).
# Obtain and unpack NPU runtime libraries to the foundation listing.
# For AOT compilation, obtain litert_npu_runtime_libraries.zip.
# For on-device compilation, obtain litert_npu_runtime_libraries_jit.zip.
$ ./litert_npu_runtime_libraries/fetch_qualcomm_library.sh
Shell
Add NPU runtime libraries as characteristic modules to the gradle configuration:
// my_app/settings.gradle.kts
embody(":litert_npu_runtime_libraries:runtime_strings")
embody(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
...
// my_app/app/construct.gradle.kts
android {
dynamicFeatures.add(":litert_npu_runtime_libraries:qualcomm_runtime_v79")
...
}
dependencies {
// Strings for NPU runtime libraries
implementation(venture(":litert_npu_runtime_libraries:runtime_strings"))
...
}
Kotlin
For an entire information on configuring your app for Play for On-device AI, please discuss with this tutorial.
Step 3: Inference on NPU utilizing LiteRT Runtime API
LiteRT abstracts away the complexity of growing towards particular SoC variations, letting you run your mannequin on the NPU with just some traces of code. It additionally offers a strong, built-in fallback mechanism: you possibly can specify CPU, GPU, or each as choices, and LiteRT will robotically use them if the NPU is unavailable. Conveniently, AOT compilation additionally helps fallback. It offers partial delegation on NPU the place unsupported subgraphs seamlessly run on CPU or GPU as specified.
// 1. Load mannequin and initialize runtime.
// If NPU is unavailable, inference will fallback to GPU.
val mannequin =
CompiledModel.create(
context.belongings,
"mannequin/mymodel.tflite",
CompiledModel.Choices(Accelerator.NPU, Accelerator.GPU)
)
// 2. Pre-allocate enter/output buffers
val inputBuffers = mannequin.createInputBuffers()
val outputBuffers = mannequin.createOutputBuffers()
// 3. Fill the primary enter
inputBuffers[0].writeFloat(...)
// 4. Invoke
mannequin.run(inputBuffers, outputBuffers)
// 5. Learn the output
val outputFloatArray = outputBuffers[0].readFloat()
Kotlin
Take a look at our picture segmentation pattern app of learn how to use all of the options
What’s subsequent
The brand new LiteRT Qualcomm AI Engine Direct (QNN) Accelerator is a significant achievement for LiteRT, closing the hole between uncooked {hardware} potential and real-world software efficiency. We’re extremely excited to see what you construct with this energy.
We encourage you to discover our LiteRT DevSite and our LiteRT Github repository. Completely satisfied constructing!
Acknowledgements
Particular because of the Google ODML crew and Qualcomm crew for his or her important contributions on this effort:
Google ODML crew: Alice Zheng, Advait Jain, Andrew Zhang, Arian Arfaian, Chintan Parikh, Chunlei Niu, Cormac Brick, Gerardo Carranza, Gregory Karpiak, Jingjiang Li, Jing Jin, Julius Kammerl, Lu Wang, Luke Boyer, Marissa Ikonomidis, Maria Lyubimtseva, Matt Kreileder, Matthias Grundmann, Na Li, Ping Yu, Quentin Khan, Rishika Sinha, Sachin Kotwani, Sebastian Schmidt, Steven Toribio, Teng-Hui Zhu, Terry (Woncheol) Heoi, Vitalii Dziuba, Weiyi Wang, Yu-Hui Chen, Zichuan We
Qualcomm LiteRT crew: Alen Huang, Bastiaan Aarts, Brett Taylor, Chun-Hsueh Lee (Jack), Chun-Po Chang (Jerry), Chun-Ting, Lin (Graham), Felix Baum, Jiun-Kai Yang (Kelvin), Krishna Sridhar, Ming-Che Lin (Vincent), William Lin






