This began the best way these items often do \u2014 watching a podcast as a substitute of doing one thing productive (I ended up scripting this weblog, so possibly it was productive in spite of everything).<\/p>\n

What This Setup Provides You<\/h2>\n
You get two main methods to work together with Granite regionally:<\/p>\n
\n
An interactive CLI for fast prompts and experimentation.<\/li>\n
A neighborhood internet interface backed by an HTTP server.<\/li>\n<\/ul>\n
Each run totally offline. No accounts, no telemetry, no background calls to something you didn\u2019t ask for.<\/p>\n
The CLI is precisely what you\u2019d anticipate. It\u2019s quick, direct, and good for testing prompts or sanity-checking conduct. Kind a query, get a solution, transfer on.<\/p>\n
$\"Interacting$ <\/p>\n
The online interface is the place issues begin to get extra fascinating. By exposing the mannequin via a neighborhood HTTP server, you\u2019re not tied to a terminal. You get streaming responses, a browser-based chat UI, and the power to work together with the mannequin over easy HTTP requests.<\/p>\n
$\"Web$ <\/p>\n
As soon as it\u2019s reachable this manner, Granite stops being \u201ca chatbot in your telephone\u201d and begins behaving like a neighborhood service. Something that may converse REST and ship JSON can work together with it, together with scripts, different apps, and automation instruments like Tasker.<\/p>\n
That is the place the \u201cdevice, not a conversational novelty\u201d thought truly comes into play. You\u2019re not restricted to typing prompts right into a UI. You’ll be able to wire the mannequin into workflows, triggers, and background duties, all with out leaving the machine or counting on a community connection.<\/p>\n
The setup stays deliberately minimal. No UI frameworks, no wrappers, no try and make this appear like a shopper app. Only a native mannequin, a easy server, and interfaces that keep out of the best way. A device.<\/p>\n
By the top of the information, Granite isn\u2019t working as a demo. It\u2019s working as a neighborhood service.<\/p>\n
Structure (The Brief Model)<\/h2>\n
At its core, this setup may be very easy:<\/p>\n
\n
A transformer-based Granite 4.0-1B mannequin.<\/li>\n
Executed regionally utilizing llama.cpp.<\/li>\n
Working on an ARM64 Android machine through Termux.<\/li>\n<\/ul>\n
There\u2019s no acceleration layer hiding within the background. No GPU, no Vulkan, no NNAPI. All the pieces runs on the CPU.<\/p>\n
The mannequin itself is the usual transformer variant of Granite 4.0-1B. IBM additionally ships Granite 4.0-H fashions that use a hybrid structure with state house layers. These are designed for various runtimes and aren\u2019t suitable with llama.cpp.<\/p>\n
On high of the runtime, there are two execution paths:<\/p>\n
\n
`llama-cli<\/code> for direct, interactive use.<\/li>\n`
llama-server<\/code> for exposing the mannequin over HTTP.<\/li>\n<\/ul>\nEach binaries use the identical mannequin file and the identical execution backend. One mannequin, two interfaces.<\/p>\n Quantization is the place most sensible trade-offs lie. In brief, quantization reduces mannequin dimension by storing weights at decrease precision. This setup makes use of a Q5_K_M quantized mannequin that balances reminiscence utilization, pace, and reasoning high quality.<\/p>\n Stipulations<\/h2>\nThere are some things you want in place earlier than this works. None of them is uncommon, however lacking any of them will present up later in much less apparent methods.<\/p>\nAndroid<\/h3>\n\nAn ARM64 Android machine (I\u2019m utilizing a Galaxy S25 Extremely)<\/li>\n At the least 8\u00a0GB of RAM beneficial<\/li>\n Termux put in from F-Droid<\/li>\n<\/ul>\nThe Play Retailer model of Termux is outdated and lacking options required to construct native code reliably. Obtain and set up F-Droid, then seek for Termux and set up it.<\/p>\n PC (Mannequin Obtain Solely)<\/h3>\n\nPython 3.10 or newer.<\/li>\n A Hugging Face account with a learn token.<\/li>\n<\/ul>\nIn case you don\u2019t wish to use Python, you may as well obtain the mannequin immediately from Hugging Face and skip token setup fully.<\/p>\n Step 1: Set up Termux<\/h2>\nWith the conditions out of the best way,\u00a0it\u2019s time to arrange the setting on the telephone.<\/p>\n As soon as Termux is put in from F-Droid,\u00a0open it and run:<\/p>\n \n\n\npkg replace\npkg improve -y<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nThis updates the bottom packages and units up entry to shared storage,\u00a0which you\u2019ll want later to position the mannequin file someplace exterior Termux\u2019s non-public listing.<\/p>\n You\u2019ll be prompted to grant storage permissions.\u00a0Settle for them.\u00a0There\u2019s no workaround right here that\u2019s well worth the effort.<\/p>\n After this completes, it’s best to have a clear, up-to-date Termux setting able to construct native code.<\/p>\n Step 2: Set up Construct Instruments<\/h2>\nWith Termux arrange,\u00a0the subsequent step is putting in the instruments wanted to construct llama.cpp regionally.<\/p>\n \n\n\npkg set up -y git cmake clang make ninja\n<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nAs soon as set up finishes,\u00a0it\u2019s price checking that the fundamentals are literally out there:<\/p>\n \n\n\ngit --version\ncmake --version\nclang --version<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nIf any of those instructions fail, cease right here and repair that first. The construct step gained\u2019t succeed in any other case.<\/p>\n Step 3: Construct Llama.cpp<\/h2>\nWith the construct instruments put in,\u00a0it\u2019s time to compile llama.cpp on the machine.<\/p>\n Begin by cloning the repository and transferring into it:<\/p>\n \n\n\ncd ~\ngit clone https:\/\/github.com\/ggml-org\/llama.cpp\ncd llama.cpp<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nThen configure the construct utilizing CMake and Ninja:<\/p>\n \n\n\ncmake -S . -B construct -G Ninja\ncmake --build construct -j $(nproc)<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nThis builds llama.cpp utilizing all out there CPU cores.\u00a0On a contemporary telephone,\u00a0this takes no quite a lot of minutes.<\/p>\n As soon as the construct completes,\u00a0confirm that the binaries had been produced:<\/p>\n \n\n\nls construct\/bin | grep llama\n<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nIt’s best to see llama-cli<\/code> and llama-server<\/code> within the output. In case you don\u2019t see them, verify the construct output and see in the event you can repair no matter is lacking.<\/p>\n This construct makes use of the CPU backend solely. No GPU, no Vulkan, no NNAPI. Nothing else is required for this setup.<\/p>\n Step 4: Choose and Obtain the Granite Mannequin<\/h2>\nIBM supplies a number of pre-quantized variations of Granite 4.0-1B on Hugging Face.\u00a0All of them share the identical base mannequin however differ in how they retailer weights,\u00a0which immediately impacts dimension,\u00a0pace,\u00a0and conduct.<\/p>\n The fashions reside on this repository:<\/p>\n \n\n\nibm-granite\/granite-4.0-1b-GGUF<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nWhy GGUF<\/h3>\nllama.cpp doesn’t run fashions of their authentic coaching format. It expects weights in the GGUF format<\/strong>, a runtime-friendly format designed\u00a0for environment friendly native inference.<\/p>\n GGUF bundles the mannequin weights along with the metadata llama.cpp wants at runtime:\u00a0tensor layouts,\u00a0tokenizer info,\u00a0and mannequin parameters.\u00a0That\u2019s why these information may be loaded immediately with out further configuration.<\/p>\n IBM supplies Granite 4 Nano fashions which might be already transformed to GGUF, eliminating\u00a0a complete preparation step.\u00a0There\u2019s no must export,\u00a0quantize,\u00a0or in any other case preprocess the mannequin simply to get it working.<\/p>\n If you wish to,\u00a0you continue to can.<\/p>\n The unique Granite fashions may be transformed to GGUF manually utilizing llama.cpp\u2019s conversion instruments,\u00a0and you may select your individual quantization settings within the course of.\u00a0That\u2019s helpful in the event you\u2019re experimenting or focusing on very particular constraints.<\/p>\n For this setup,\u00a0there\u2019s no actual upside.\u00a0The offered GGUF information have already been examined and are able to run.\u00a0Utilizing them retains the concentrate on working the mannequin reasonably than\u00a0getting ready it.<\/p>\n Quantization Alternative<\/h3>\nYou\u2019ll see an extended checklist of information with names like Q2,\u00a0This fall,\u00a0Q5,\u00a0Q8,\u00a0and F16.\u00a0These consult with completely different quantization ranges.<\/p>\n At a excessive stage:<\/p>\n \nDecrease quantization means smaller information and quicker inference, however weaker reasoning.<\/li>\n Increased quantization ends in higher output high quality however increased reminiscence utilization and slower efficiency.<\/li>\n<\/ul>\nOn cellular,\u00a0this can be a balancing act.\u00a0Very small fashions reply rapidly however disintegrate when confronted with something past easy prompts.\u00a0Very giant ones work,\u00a0however supply diminishing returns and pointless reminiscence stress.<\/p>\n For this setup,\u00a0Q5_K_M is an effective center floor.\u00a0It\u2019s sufficiently small to run comfortably on a contemporary telephone,\u00a0however constant sufficient to deal with longer prompts and multi-step directions with out drifting.<\/p>\n That\u2019s the model used all through the remainder of this information.<\/p>\n Authentication and Obtain<\/h3>\nGranite fashions require authentication to obtain.<\/p>\n On this setup,\u00a0authentication is dealt with utilizing a Hugging Face learn token offered through an setting variable.\u00a0This avoids interactive logins and retains the method scriptable and reproducible.<\/p>\n Create a learn token through the Hugging Face internet UI, then export it in your PC:<\/p>\n \n\n\n$env:HUGGINGFACE_HUB_TOKEN=\"hf_...\"<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nWith the token set,\u00a0obtain the mannequin utilizing Python:<\/p>\n \n\n\npython -c \"from huggingface_hub import hf_hub_download;\nhf_hub_download(repo_id='ibm-granite\/granite-4.0-1b-GGUF',\nfilename=\"granite-4.0-1b-Q5_K_M.gguf\", local_dir=\"granite-4.0-1b-gguf\")\"<\/span><\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nIn case you don\u2019t wish to use Python or don\u2019t wish to swap gadgets, you may as well obtain the mannequin immediately from the Hugging Face web site and skip the token setup fully (you will want an account): https:\/\/huggingface.co\/ibm-granite\/granite-4.0-1b-GGUF<\/a>.<\/p>\n As soon as the file is downloaded, you\u2019re completed with the PC. The subsequent step is transferring the mannequin onto the telephone.<\/p>\n Step 5: Copy the Mannequin to Android<\/h2>\nAs soon as the mannequin file is downloaded,\u00a0it must be copied onto the telephone.<\/p>\n Place the file on the following location:<\/p>\n \n\n\n\/storage\/emulated\/0\/fashions\/granite-4.0-1b-Q5_K_M.gguf\n<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nOn Android, \/storage\/emulated\/0<\/code> is the bottom listing you see when opening your file supervisor. It\u2019s sometimes labelled as inner storage or telephone storage. Making a fashions<\/code> folder there retains issues easy and simple to search out.<\/p>\n The precise listing title doesn\u2019t matter a lot,\u00a0however holding fashions exterior Termux\u2019s dwelling listing makes them simpler to handle and reuse later.<\/p>\n After copying the file, confirm it from inside Termux:<\/p>\n \n\n\nls -lh \/storage\/emulated\/0\/fashions\/granite-4.0-1b-Q5_K_M.gguf<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nIt’s best to see the file listed at roughly 1.2 GB. If it\u2019s there, Termux can entry it, and also you\u2019re prepared to maneuver on.<\/p>\n Step 6: Handbook Validation Run<\/h2>\nEarlier than wiring something up or automating it,\u00a0it\u2019s price ensuring the mannequin truly runs.<\/p>\n From contained in the llama.cpp<\/code> listing, run the next command:<\/p>\n \n\n\n.\/construct\/bin\/llama-cli \n -m \/storage\/emulated\/0\/fashions\/granite-4.0-1b-Q5_K_M.gguf \n -t 8 \n -c 2048 \n --temp 0.7 \n --top-p 0.9 \n\u00a0 -p \"Clarify DNS in easy phrases.\"<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nOn a Galaxy S25 Extremely,\u00a0it’s best to see one thing within the ballpark of:<\/p>\n \nimmediate processing round ~45\u201350 tokens\/sec<\/li>\n technology pace round ~20\u201322 tokens\/sec<\/li>\n<\/ul>\n<\/p>\n At round 20 tokens per second, technology is already quicker than most individuals can learn.<\/p>\n The context dimension is ready to 2048 tokens as a steady default for cellular.\u00a0Bigger values improve reminiscence utilization and don\u2019t purchase you a lot for this type of setup.<\/p>\n In case you run into out-of-memory errors,\u00a0sudden course of termination,\u00a0or aggressive thermal throttling,\u00a0cut back the thread rely.<\/p>\n Cheap fallbacks are:<\/p>\n or, if wanted:<\/p>\n If this works, the laborious half is over (not that arduous, actually).<\/p>\n Step 7: Startup Script (Server + CLI)<\/h2>\nNow that the mannequin runs manually,\u00a0it\u2019s time to make it barely extra helpful.\u00a0An online browser tends to be extra user-friendly than a terminal session anyway.<\/p>\n The purpose right here is straightforward:<\/p>\n \nBegin the HTTP server within the background.<\/li>\n Drop straight into an interactive CLI session (for the actual techies amongst you).<\/li>\n<\/ul>\nCreate a startup script in your house listing:<\/p>\n \n\n\nnano ~\/granite-4.0-1b-start.sh<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nAdd the next:<\/p>\n \n\n\n#!\/information\/information\/com.termux\/information\/usr\/bin\/bash\n\nMODEL=\"\/storage\/emulated\/0\/fashions\/granite-4.0-1b-Q5_K_M.gguf\"\nBIN=\"$HOME\/llama.cpp\/construct\/bin\"\n\n$BIN\/llama-server \n -m \"$MODEL\" \n -t 8 \n -c 2048 \n --host 127.0.0.1 \n --port 8080 \n > ~\/granite-server.log 2>&1 &\n\nsleep 3\n\n$BIN\/llama-cli \n -m \"$MODEL\" \n -t 8 \n -c 2048 \n --temp 0.7 \n\u00a0 --top-p 0.9 <\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nMake the script executable:<\/p>\n \n\n\nchmod +x ~\/granite-4.0-1b-start.sh<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nRun it:<\/p>\n \n\n\n.\/granite-4.0-1b-start.sh\n<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nWhenever you exit the CLI, the HTTP server retains working.<\/p>\n Step 8: Net UI<\/h2>\nWith the server working, open a browser on the telephone and navigate to:<\/p>\n That\u2019s it.<\/p>\n You\u2019ll get a web-based chat interface backed by the native HTTP server. Prompts are despatched to the mannequin, responses stream again in actual time, and all the pieces stays on-device. It’s a bit slower than the CLI, however nonetheless very helpful.<\/p>\n <\/p>\n The interface retains issues easy, however it\u2019s not naked bones. You get correct chat conduct: dialog historical past is preserved, responses may be edited and regenerated, and you may work with a number of chats in parallel. In follow, it behaves very like the net interfaces persons are already used to, simply backed by a mannequin working regionally on the machine.<\/p>\n As a result of the server binds to127.0.0.1<\/code>,\u00a0it\u2019s solely accessible regionally.<\/p>\n At this level, you possibly can shut the terminal in the event you like. If the server course of continues to be working, the net UI will proceed to work.<\/p>\n Step 9: Auto-Begin on Termux Launch<\/h2>\nAt this level,\u00a0all the pieces works.\u00a0The final step is making it stick.<\/p>\n The purpose right here is straightforward:\u00a0once you open Termux,\u00a0Granite begins robotically.\u00a0No handbook instructions,\u00a0no remembering which script to run.\u00a0Prepared to make use of,\u00a0each time.<\/p>\n Edit your shell startup file:<\/p>\n Append the next:<\/p>\n \n\n\nif [ -z \"$GRANITE_STARTED\" ]; then\n export GRANITE_STARTED=1\n ~\/granite-4.0-1b-start.sh\nfi<\/code><\/pre>\n<\/p><\/div><\/div>\n<\/div>\nThis ensures the startup script runs as soon as per Termux session. The guard variable prevents unintended double begins, and shutting Termux cleanly shuts all the pieces down.<\/p>\n If Termux crashes or is force-stopped,\u00a0the guard resets, and Granite will begin once more the subsequent time you open it.<\/p>\n Stopping the Server<\/h3>\nIf you wish to cease the HTTP server with out closing Termux:<\/p>\n That\u2019s it. From right here on out, opening Termux is sufficient to carry Granite again on-line.<\/p>\n Notes<\/h2>\nJust a few sensible issues price holding in thoughts after setting this up:<\/p>\n \nGranite 4.0-H fashions use a hybrid structure with state house layers and usually are not suitable with llama.cpp. This setup solely applies to the transformer-based Granite 4 Nano fashions.<\/li>\n Q5_K_M works properly on trendy telephones. In case you run into stability points, decreasing the thread rely is often step one.<\/li>\n The CLI and HTTP server can run on the identical time. Exiting the CLI doesn’t have an effect on the server so long as the Termux session stays open.<\/li>\n As soon as the mannequin is downloaded, all the pieces runs totally offline. No community entry is required for inference.<\/li>\n The HTTP server is sure to localhost by default. Exposing it to the community is feasible, however deliberately not lined right here.<\/li>\n Efficiency, thermals, and battery affect range by machine. Newer telephones deal with this comfortably; older ones might have extra conservative settings.<\/li>\n This setup is just not optimized for background execution or for lengthy battery life. It\u2019s meant to be sensible, not invisible.<\/li>\n<\/ul>\nClosing<\/h2>\nAt this level, Granite is working regionally on the machine, begins robotically with Termux, and is accessible each interactively and over HTTP.<\/p>\n I\u2019ve stated this already, however that\u2019s what a closing is for, proper?<\/p>\n There\u2019s no cloud dependency, no account setup, and no particular runtime past what\u2019s proven above. As soon as the mannequin is in place, all the pieces else is simply course of administration.<\/p>\n It\u2019s not notably spectacular to have a look at. It\u2019s simply helpful.<\/p>\n Which is precisely what you need from a neighborhood mannequin.<\/p>\n Have enjoyable!<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":" This began the best way these items often do \u2014 watching a podcast as a substitute of doing one thing productive (I ended up scripting this weblog, so possibly it was productive in spite of everything). I used to be listening to a Neuron AI episode about IBM\u2019s new Granite 4 mannequin household, with IBM […]<\/p>\n","protected":false},"author":2,"featured_media":11475,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[56],"tags":[7705,797,7704,7706,839],"class_list":["post-11473","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-software","tag-4-01b","tag-android","tag-granite","tag-locally","tag-running"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11473","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=11473"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11473\/revisions"}],"predecessor-version":[{"id":11474,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/11473\/revisions\/11474"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/11475"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=11473"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=11473"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=11473"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

Stipulations<\/h2>\n
There are some things you want in place earlier than this works. None of them is uncommon, however lacking any of them will present up later in much less apparent methods.<\/p>\n

What This Setup Provides You<\/h2>\nYou get two main methods to work together with Granite regionally:<\/p>\n\nAn interactive CLI for fast prompts and experimentation.<\/li>\n

Structure (The Brief Model)<\/h2>\nAt its core, this setup may be very easy:<\/p>\n

Stipulations<\/h2>\nThere are some things you want in place earlier than this works. None of them is uncommon, however lacking any of them will present up later in much less apparent methods.<\/p>\n

Stipulations<\/h2>\n
There are some things you want in place earlier than this works. None of them is uncommon, however lacking any of them will present up later in much less apparent methods.<\/p>\n