Deploying and managing Llama 4 fashions includes a number of steps: navigating complicated infrastructure setup, managing GPU availability, guaranteeing scalability, and dealing with ongoing operational overhead. What in the event you may tackle these challenges and focus straight on constructing your purposes? It’s potential with Vertex AI.
We’re thrilled to announce that Llama 4, the most recent technology of Meta’s open massive language fashions, is now usually obtainable (GA) as a absolutely managed API endpoint in Vertex AI! Along with Llama 4, we’re additionally asserting the final availability of the Llama 3.3 70B managed API in Vertex AI.
Llama 4 reaches new efficiency peaks in comparison with earlier Llama fashions, with multimodal capabilities and a extremely environment friendly Combination-of-Consultants (MoE) structure. Llama 4 Scout is extra highly effective than all earlier generations of Llama fashions whereas additionally delivering vital effectivity for multimodal duties and is optimized to run in a single-GPU surroundings. Llama 4 Maverick is probably the most clever mannequin possibility Meta supplies in the present day, designed for reasoning, complicated picture understanding, and demanding generative duties.
With Llama 4 as a completely managed API endpoint, now you can leverage Llama 4’s superior reasoning, coding, and instruction-following capabilities with the convenience, scalability, and reliability of Vertex AI to construct extra refined and impactful AI-powered purposes.
This put up will information you thru getting began with Llama 4 as a Mannequin-as-a-Service (MaaS), spotlight the important thing advantages, present you ways easy it’s to make use of, and contact upon value issues.
Uncover Llama 4 MaaS in Vertex AI Mannequin Backyard
Vertex AI Mannequin Backyard is your central hub for locating and deploying basis fashions on Google Cloud by way of managed APIs. It provides a curated choice of Google’s personal fashions (like Gemini), open-source fashions, and third-party fashions — all accessible by simplified interfaces. The addition of Llama 4 (GA) as a managed service expands this choice, providing you extra flexibility.
Accessing Llama 4 as a Mannequin-as-a-Service (MaaS) on Vertex AI has the next benefits:
1: Zero infrastructure administration: Google Cloud handles the underlying infrastructure, GPU provisioning, software program dependencies, patching, and upkeep. You work together with a easy API endpoint.
2: Assured efficiency with provisioned throughput: Reserve devoted processing capability in your fashions at a hard and fast charge, guaranteeing excessive availability and prioritized processing in your requests, even when the system is overloaded.
3: Enterprise-grade safety and compliance: Profit from Google Cloud’s strong safety, knowledge encryption, entry controls, and compliance certifications.
Getting began with Llama 4 MaaS
Getting began with Llama 4 MaaS on Vertex AI solely requires you to navigate to the Llama 4 mannequin card throughout the Vertex AI Mannequin Backyard and settle for the Llama Group License Settlement; you can not name the API with out finishing this step.
After you have accepted the Llama Group License Settlement within the Mannequin Backyard, discover the particular Llama 4 MaaS mannequin you want to use throughout the Vertex AI Mannequin Backyard (e.g., “Llama 4 17B Instruct MaaS”). Be aware of its distinctive Mannequin ID (like meta/llama-4-scout-17b-16e-instruct-maas), as you will want this ID when calling the API.
Then you’ll be able to straight name the Llama 4 MaaS endpoint utilizing the ChatCompletion API. There isn’t any separate “deploy” step required for the MaaS providing – Google Cloud manages the endpoint provisioning. Under is an instance of learn how to use Llama 4 Scout utilizing the ChatCompletion API for Python.
import openai
from google.auth import default, transport
import os
# --- Configuration ---
PROJECT_ID = ""
LOCATION = "us-east5"
MODEL_ID = "meta/llama-4-scout-17b-16e-instruct-maas"
# Acquire Software Default Credentials (ADC) token
credentials, _ = default()
auth_request = transport.requests.Request()
credentials.refresh(auth_request)
gcp_token = credentials.token
# Assemble the Vertex AI MaaS endpoint URL for OpenAI library
vertex_ai_endpoint_url = (
f"https://{LOCATION}-aiplatform.googleapis.com/v1beta1/"
f"initiatives/{PROJECT_ID}/areas/{LOCATION}/endpoints/openapi"
)
# Initialize the consumer to make use of ChatCompletion API pointing to Vertex AI MaaS
consumer = openai.OpenAI(
base_url=vertex_ai_endpoint_url,
api_key=gcp_token, # Use the GCP token because the API key
)
# Instance: Multimodal request (textual content + picture from Cloud Storage)
prompt_text = "Describe this landmark and its significance."
image_gcs_uri = "gs://cloud-samples-data/imaginative and prescient/landmark/eiffel_tower.jpg"
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": image_gcs_uri},
},
{"type": "text", "text": prompt_text},
],
}
]
# Non-compulsory parameters (check with mannequin card for specifics)
max_tokens_to_generate = 1024
request_temperature = 0.7
request_top_p = 1.0
# Name the ChatCompletion API
response = consumer.chat.completions.create(
mannequin=MODEL_ID, # Specify the Llama 4 MaaS mannequin ID
messages=messages,
max_tokens=max_tokens_to_generate,
temperature=request_temperature,
top_p=request_top_p,
# stream=False # Set to True for streaming responses
)
generated_text = response.decisions[0].message.content material
print(generated_text)
# The picture accommodates...
Essential: At all times seek the advice of the particular Llama 4 mannequin card in Vertex AI Mannequin Backyard. It accommodates essential details about:
- The precise enter/output schema anticipated by the mannequin.
- Supported parameters (like temperature, top_p, max_tokens) and their legitimate ranges.
- Any particular formatting necessities for prompts or multimodal inputs.
Value and quota issues
Utilizing the Llama 4 as Mannequin-as-a-Service on Vertex AI operates on a predictable mannequin combining pay-as-you-go pricing with utilization quotas. Understanding each the pricing construction and your service quotas is important for scaling your utility and managing prices successfully when utilizing the Llama 4 MaaS on Vertex AI.
With reference to pricing, you pay just for the prediction requests you make. The underlying infrastructure, scaling, and administration prices are included into the API utilization value. Confer with the Vertex AI pricing web page for particulars.
To make sure service stability and honest utilization, your use of Llama 4 as Mannequin-as-service on Vertex AI is topic to quotas. These are limits on elements such because the variety of requests per minute (RPM) your venture could make to the particular mannequin endpoint. Confer with our quota documentation for extra particulars.
What’s subsequent
With Llama 4 now usually obtainable as a Mannequin-as-a-Service on Vertex AI, you’ll be able to leverage one of the vital superior open LLMs with out managing required infrastructure.
We’re excited to see what purposes you’ll construct with Llama 4 on Vertex AI. Share your suggestions and experiences by our Google Cloud neighborhood discussion board.