Earlier this yr, we talked about that we’re bringing pc use capabilities to builders by way of the Gemini API. Immediately, we’re releasing the Gemini 2.5 Laptop Use mannequin, our new specialised mannequin constructed on Gemini 2.5 Professional’s visible understanding and reasoning capabilities that powers brokers able to interacting with consumer interfaces (UIs). It outperforms main alternate options on a number of internet and cellular management benchmarks, all with decrease latency. Builders can entry these capabilities by way of the Gemini API in Google AI Studio and Vertex AI.
Whereas AI fashions can interface with software program by structured APIs, many digital duties nonetheless require direct interplay with graphical consumer interfaces, for instance, filling and submitting kinds. To finish these duties, brokers should navigate internet pages and functions simply as people do: by clicking, typing and scrolling. The power to natively fill out kinds, manipulate interactive components like dropdowns and filters, and function behind logins is a vital subsequent step in constructing highly effective, general-purpose brokers.
The way it works
The mannequin’s core capabilities are uncovered by the brand new `computer_use` instrument within the Gemini API and ought to be operated inside a loop. Inputs to the instrument are the consumer request, screenshot of the surroundings, and a historical past of latest actions. The enter can even specify whether or not to exclude features from the full listing of supported UI actions or specify further customized features to incorporate.







