Working Granite 4.0-1B Domestically on Android

This began the best way these items often do — watching a podcast as a substitute of doing one thing productive (I ended up scripting this weblog, so possibly it was productive in spite of everything).

I used to be listening to a Neuron AI episode about IBM’s new Granite 4 mannequin household, with IBM Analysis’s David Cox because the visitor. In the course of the dialogue on mannequin sizes and deployment targets, they talked about Granite 4 Nano, fashions designed particularly for edge and on-device use instances. Sooner or later, the dialogue turned to working these fashions in your telephone.

Not as a hypothetical. Not as a demo. Simply as a factor you may do.

That was sufficient.

As a result of as soon as somebody says, “You’ll be able to run this in your telephone,” in that context, the one affordable response is to cease listening and check out it your self.

Granite 4 Nano isn’t pitched as a toy mannequin. What makes it fascinating is that it’s been designed to be small on function. That constraint reveals up in the way it behaves: extra direct solutions, much less wandering, and a basic sense that it’s meant for use as a device reasonably than a conversational novelty.

In order that’s what that is. Granite 4.0-1B. Totally offline. Working regionally on an Android telephone. No cloud. No GPU. No vendor magic. Only a barely unhealthy stage of curiosity.

The outcome was surprisingly boring. Which is precisely what you need.

I’ve saved this deliberately step-based so it’s simple to breed with out guessing or filling in gaps.

What This Setup Provides You

You get two main methods to work together with Granite regionally:

An interactive CLI for fast prompts and experimentation.
A neighborhood internet interface backed by an HTTP server.

Each run totally offline. No accounts, no telemetry, no background calls to something you didn’t ask for.

The CLI is precisely what you’d anticipate. It’s quick, direct, and good for testing prompts or sanity-checking conduct. Kind a query, get a solution, transfer on.

The online interface is the place issues begin to get extra fascinating. By exposing the mannequin via a neighborhood HTTP server, you’re not tied to a terminal. You get streaming responses, a browser-based chat UI, and the power to work together with the mannequin over easy HTTP requests.

As soon as it’s reachable this manner, Granite stops being “a chatbot in your telephone” and begins behaving like a neighborhood service. Something that may converse REST and ship JSON can work together with it, together with scripts, different apps, and automation instruments like Tasker.

That is the place the “device, not a conversational novelty” thought truly comes into play. You’re not restricted to typing prompts right into a UI. You’ll be able to wire the mannequin into workflows, triggers, and background duties, all with out leaving the machine or counting on a community connection.

The setup stays deliberately minimal. No UI frameworks, no wrappers, no try and make this appear like a shopper app. Only a native mannequin, a easy server, and interfaces that keep out of the best way. A device.

By the top of the information, Granite isn’t working as a demo. It’s working as a neighborhood service.

Structure (The Brief Model)

At its core, this setup may be very easy:

A transformer-based Granite 4.0-1B mannequin.
Executed regionally utilizing llama.cpp.
Working on an ARM64 Android machine through Termux.

There’s no acceleration layer hiding within the background. No GPU, no Vulkan, no NNAPI. All the pieces runs on the CPU.

The mannequin itself is the usual transformer variant of Granite 4.0-1B. IBM additionally ships Granite 4.0-H fashions that use a hybrid structure with state house layers. These are designed for various runtimes and aren’t suitable with llama.cpp.

On high of the runtime, there are two execution paths:

llama-cli for direct, interactive use.
llama-server for exposing the mannequin over HTTP.

Each binaries use the identical mannequin file and the identical execution backend. One mannequin, two interfaces.

Quantization is the place most sensible trade-offs lie. In brief, quantization reduces mannequin dimension by storing weights at decrease precision. This setup makes use of a Q5_K_M quantized mannequin that balances reminiscence utilization, pace, and reasoning high quality.

Stipulations

There are some things you want in place earlier than this works. None of them is uncommon, however lacking any of them will present up later in much less apparent methods.

Android

An ARM64 Android machine (I’m utilizing a Galaxy S25 Extremely)
At the least 8 GB of RAM beneficial
Termux put in from F-Droid

The Play Retailer model of Termux is outdated and lacking options required to construct native code reliably. Obtain and set up F-Droid, then seek for Termux and set up it.

PC (Mannequin Obtain Solely)

Python 3.10 or newer.
A Hugging Face account with a learn token.

In case you don’t wish to use Python, you may as well obtain the mannequin immediately from Hugging Face and skip token setup fully.

Step 1: Set up Termux

With the conditions out of the best way, it’s time to arrange the setting on the telephone.

As soon as Termux is put in from F-Droid, open it and run:

pkg replace
pkg improve -y

This updates the bottom packages and units up entry to shared storage, which you’ll want later to position the mannequin file someplace exterior Termux’s non-public listing.

You’ll be prompted to grant storage permissions. Settle for them. There’s no workaround right here that’s well worth the effort.

After this completes, it’s best to have a clear, up-to-date Termux setting able to construct native code.

Step 2: Set up Construct Instruments

With Termux arrange, the subsequent step is putting in the instruments wanted to construct llama.cpp regionally.

pkg set up -y git cmake clang make ninja

As soon as set up finishes, it’s price checking that the fundamentals are literally out there:

git --version
cmake --version
clang --version

If any of those instructions fail, cease right here and repair that first. The construct step gained’t succeed in any other case.

Step 3: Construct Llama.cpp

With the construct instruments put in, it’s time to compile llama.cpp on the machine.

Begin by cloning the repository and transferring into it:

cd ~
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Then configure the construct utilizing CMake and Ninja:

cmake -S . -B construct -G Ninja
cmake --build construct -j $(nproc)

This builds llama.cpp utilizing all out there CPU cores. On a contemporary telephone, this takes no quite a lot of minutes.

As soon as the construct completes, confirm that the binaries had been produced:

ls construct/bin | grep llama

It’s best to see llama-cli and llama-server within the output. In case you don’t see them, verify the construct output and see in the event you can repair no matter is lacking.

This construct makes use of the CPU backend solely. No GPU, no Vulkan, no NNAPI. Nothing else is required for this setup.

Step 4: Choose and Obtain the Granite Mannequin

IBM supplies a number of pre-quantized variations of Granite 4.0-1B on Hugging Face. All of them share the identical base mannequin however differ in how they retailer weights, which immediately impacts dimension, pace, and conduct.

The fashions reside on this repository:

ibm-granite/granite-4.0-1b-GGUF

Why GGUF

llama.cpp doesn’t run fashions of their authentic coaching format. It expects weights in the GGUF format, a runtime-friendly format designed for environment friendly native inference.

GGUF bundles the mannequin weights along with the metadata llama.cpp wants at runtime: tensor layouts, tokenizer info, and mannequin parameters. That’s why these information may be loaded immediately with out further configuration.

IBM supplies Granite 4 Nano fashions which might be already transformed to GGUF, eliminating a complete preparation step. There’s no must export, quantize, or in any other case preprocess the mannequin simply to get it working.

If you wish to, you continue to can.

The unique Granite fashions may be transformed to GGUF manually utilizing llama.cpp’s conversion instruments, and you may select your individual quantization settings within the course of. That’s helpful in the event you’re experimenting or focusing on very particular constraints.

For this setup, there’s no actual upside. The offered GGUF information have already been examined and are able to run. Utilizing them retains the concentrate on working the mannequin reasonably than getting ready it.

Quantization Alternative

You’ll see an extended checklist of information with names like Q2, This fall, Q5, Q8, and F16. These consult with completely different quantization ranges.

At a excessive stage:

Decrease quantization means smaller information and quicker inference, however weaker reasoning.
Increased quantization ends in higher output high quality however increased reminiscence utilization and slower efficiency.

On cellular, this can be a balancing act. Very small fashions reply rapidly however disintegrate when confronted with something past easy prompts. Very giant ones work, however supply diminishing returns and pointless reminiscence stress.

For this setup, Q5_K_M is an effective center floor. It’s sufficiently small to run comfortably on a contemporary telephone, however constant sufficient to deal with longer prompts and multi-step directions with out drifting.

That’s the model used all through the remainder of this information.

Authentication and Obtain

Granite fashions require authentication to obtain.

On this setup, authentication is dealt with utilizing a Hugging Face learn token offered through an setting variable. This avoids interactive logins and retains the method scriptable and reproducible.

Create a learn token through the Hugging Face internet UI, then export it in your PC:

$env:HUGGINGFACE_HUB_TOKEN="hf_..."

With the token set, obtain the mannequin utilizing Python:

python -c "from huggingface_hub import hf_hub_download;
hf_hub_download(repo_id='ibm-granite/granite-4.0-1b-GGUF',
filename="granite-4.0-1b-Q5_K_M.gguf", local_dir="granite-4.0-1b-gguf")"

In case you don’t wish to use Python or don’t wish to swap gadgets, you may as well obtain the mannequin immediately from the Hugging Face web site and skip the token setup fully (you will want an account): https://huggingface.co/ibm-granite/granite-4.0-1b-GGUF.

As soon as the file is downloaded, you’re completed with the PC. The subsequent step is transferring the mannequin onto the telephone.

Step 5: Copy the Mannequin to Android

As soon as the mannequin file is downloaded, it must be copied onto the telephone.

Place the file on the following location:

/storage/emulated/0/fashions/granite-4.0-1b-Q5_K_M.gguf

On Android, /storage/emulated/0 is the bottom listing you see when opening your file supervisor. It’s sometimes labelled as inner storage or telephone storage. Making a fashions folder there retains issues easy and simple to search out.

The precise listing title doesn’t matter a lot, however holding fashions exterior Termux’s dwelling listing makes them simpler to handle and reuse later.

After copying the file, confirm it from inside Termux:

ls -lh /storage/emulated/0/fashions/granite-4.0-1b-Q5_K_M.gguf

It’s best to see the file listed at roughly 1.2 GB. If it’s there, Termux can entry it, and also you’re prepared to maneuver on.

Step 6: Handbook Validation Run

Earlier than wiring something up or automating it, it’s price ensuring the mannequin truly runs.

From contained in the llama.cpp listing, run the next command:

./construct/bin/llama-cli 
  -m /storage/emulated/0/fashions/granite-4.0-1b-Q5_K_M.gguf 
  -t 8 
  -c 2048 
  --temp 0.7 
  --top-p 0.9 
  -p "Clarify DNS in easy phrases."

On a Galaxy S25 Extremely, it’s best to see one thing within the ballpark of:

immediate processing round ~45–50 tokens/sec
technology pace round ~20–22 tokens/sec

At round 20 tokens per second, technology is already quicker than most individuals can learn.

The context dimension is ready to 2048 tokens as a steady default for cellular. Bigger values improve reminiscence utilization and don’t purchase you a lot for this type of setup.

In case you run into out-of-memory errors, sudden course of termination, or aggressive thermal throttling, cut back the thread rely.

Cheap fallbacks are:

or, if wanted:

If this works, the laborious half is over (not that arduous, actually).

Step 7: Startup Script (Server + CLI)

Now that the mannequin runs manually, it’s time to make it barely extra helpful. An online browser tends to be extra user-friendly than a terminal session anyway.

The purpose right here is straightforward:

Begin the HTTP server within the background.
Drop straight into an interactive CLI session (for the actual techies amongst you).

Create a startup script in your house listing:

nano ~/granite-4.0-1b-start.sh

Add the next:

#!/information/information/com.termux/information/usr/bin/bash

MODEL="/storage/emulated/0/fashions/granite-4.0-1b-Q5_K_M.gguf"
BIN="$HOME/llama.cpp/construct/bin"

$BIN/llama-server 
  -m "$MODEL" 
  -t 8 
  -c 2048 
  --host 127.0.0.1 
  --port 8080 
  > ~/granite-server.log 2>&1 &

sleep 3

$BIN/llama-cli 
  -m "$MODEL" 
  -t 8 
  -c 2048 
  --temp 0.7 
  --top-p 0.9

Make the script executable:

chmod +x ~/granite-4.0-1b-start.sh

Run it:

./granite-4.0-1b-start.sh

Whenever you exit the CLI, the HTTP server retains working.

Step 8: Net UI

With the server working, open a browser on the telephone and navigate to:

That’s it.

You’ll get a web-based chat interface backed by the native HTTP server. Prompts are despatched to the mannequin, responses stream again in actual time, and all the pieces stays on-device. It’s a bit slower than the CLI, however nonetheless very helpful.

The interface retains issues easy, however it’s not naked bones. You get correct chat conduct: dialog historical past is preserved, responses may be edited and regenerated, and you may work with a number of chats in parallel. In follow, it behaves very like the net interfaces persons are already used to, simply backed by a mannequin working regionally on the machine.

As a result of the server binds to127.0.0.1, it’s solely accessible regionally.

At this level, you possibly can shut the terminal in the event you like. If the server course of continues to be working, the net UI will proceed to work.

Step 9: Auto-Begin on Termux Launch

At this level, all the pieces works. The final step is making it stick.

The purpose right here is straightforward: once you open Termux, Granite begins robotically. No handbook instructions, no remembering which script to run. Prepared to make use of, each time.

Edit your shell startup file:

Append the next:

if [ -z "$GRANITE_STARTED" ]; then
  export GRANITE_STARTED=1
  ~/granite-4.0-1b-start.sh
fi

This ensures the startup script runs as soon as per Termux session. The guard variable prevents unintended double begins, and shutting Termux cleanly shuts all the pieces down.

If Termux crashes or is force-stopped, the guard resets, and Granite will begin once more the subsequent time you open it.

Stopping the Server

If you wish to cease the HTTP server with out closing Termux:

That’s it. From right here on out, opening Termux is sufficient to carry Granite again on-line.

Notes

Just a few sensible issues price holding in thoughts after setting this up:

Granite 4.0-H fashions use a hybrid structure with state house layers and usually are not suitable with llama.cpp. This setup solely applies to the transformer-based Granite 4 Nano fashions.
Q5_K_M works properly on trendy telephones. In case you run into stability points, decreasing the thread rely is often step one.
The CLI and HTTP server can run on the identical time. Exiting the CLI doesn’t have an effect on the server so long as the Termux session stays open.
As soon as the mannequin is downloaded, all the pieces runs totally offline. No community entry is required for inference.
The HTTP server is sure to localhost by default. Exposing it to the community is feasible, however deliberately not lined right here.
Efficiency, thermals, and battery affect range by machine. Newer telephones deal with this comfortably; older ones might have extra conservative settings.
This setup is just not optimized for background execution or for lengthy battery life. It’s meant to be sensible, not invisible.

Closing

At this level, Granite is working regionally on the machine, begins robotically with Termux, and is accessible each interactively and over HTTP.

I’ve stated this already, however that’s what a closing is for, proper?

There’s no cloud dependency, no account setup, and no particular runtime past what’s proven above. As soon as the mannequin is in place, all the pieces else is simply course of administration.

It’s not notably spectacular to have a look at. It’s simply helpful.

Which is precisely what you need from a neighborhood mannequin.

Have enjoyable!