ToolSimulator: scalable device testing for AI brokers

You should use ToolSimulator, an LLM-powered device simulation framework inside Strands Evals, to totally and safely check AI brokers that depend on exterior instruments, at scale. As a substitute of risking dwell API calls that expose personally identifiable data (PII), set off unintended actions, or settling for static mocks that break with multi-turn workflows, you need to use ToolSimulator’s giant language mannequin (LLM)-powered simulations to validate your brokers. Accessible immediately as a part of the Strands Evals Software program Growth Package (SDK), ToolSimulator helps you catch integration bugs early, check edge circumstances comprehensively, and ship production-ready brokers with confidence.

On this submit, you’ll discover ways to:

Arrange ToolSimulator and register instruments for simulation
Configure stateful device simulations for multi-turn agent workflows
Implement response schemas with Pydantic fashions
Combine ToolSimulator into a whole Strands Evals analysis pipeline
Apply greatest practices for simulation-based agent analysis

Conditions

Earlier than you start, just be sure you have the next:

Python 3.10 or later put in in your surroundings
Strands Evals SDK put in: pip set up strands-evals
Primary familiarity with Python, together with decorators and kind hints
Familiarity with AI brokers and tool-calling ideas (API calls, operate schemas)
Pydantic data is useful for the superior schema examples, however isn’t required to get began
An AWS account isn’t required to run ToolSimulator regionally

Why device testing challenges your growth workflow

Fashionable AI brokers don’t simply cause. They name APIs, question databases, invoke Mannequin Context Protocol (MCP) companies, and work together with exterior programs to finish duties. Your agent’s conduct relies upon not solely on its reasoning, however on what these instruments return. Once you check these brokers in opposition to dwell APIs, you run into three challenges that gradual you down and put your programs in danger.

Three challenges that dwell APIs create:

Exterior dependencies gradual you down. Stay APIs impose charge limits, expertise downtime, and require community connectivity. Once you’re working lots of of check circumstances, these constraints make complete testing impractical.
Check isolation turns into dangerous. Actual device calls set off actual uncomfortable side effects. You threat sending precise emails, modifying manufacturing databases, or reserving precise flights throughout testing. Your agent exams shouldn’t work together with the programs that they’re testing in opposition to.
Privateness and safety create limitations. Many instruments deal with delicate information, together with consumer data, monetary data, and PII. Working exams in opposition to dwell programs unnecessarily exposes that information and creates compliance dangers.

Why static mocks fall brief

You may think about static mocks instead. Static mocks work for straightfoward, predictable situations, however they require fixed upkeep as your APIs evolve. Extra importantly, they break down within the multi-turn, stateful workflows that actual brokers carry out.

Think about a flight reserving agent. It searches for flights with one device name, then checks reserving standing with one other. The second response ought to depend upon what the primary name did. A hardcoded response can’t mirror a database that adjustments state between calls. Static mocks can’t seize this.

What makes ToolSimulator totally different

ToolSimulator solves these challenges with three important capabilities that work collectively to provide you protected, scalable agent testing with out sacrificing realism.

Adaptive response technology. Device outputs mirror what your agent really requested, not a set template. When your agent calls to seek for Seattle-to-New York flights, ToolSimulator returns believable choices with real looking costs and occasions, not a generic placeholder.
Stateful workflow help. Many real-world instruments preserve state throughout calls. A write operation ought to have an effect on subsequent reads. ToolSimulator maintains constant shared state throughout device calls, making it protected to check database interactions, reserving workflows, and multi-step processes with out touching manufacturing programs.
Schema enforcement. Builders sometimes add a post-processing layer that parses uncooked device output right into a structured format. When a device returns a malformed response, this layer breaks. ToolSimulator validates responses in opposition to Pydantic schemas that you just outline, catching malformed responses earlier than they attain your agent.

How ToolSimulator works

Determine 1: ToolSimulator (TS) intercepts device calls and routes them to an LLM-based response generator

ToolSimulator intercepts calls to your registered instruments and routes them to an LLM-based response generator. The generator makes use of the device schema, your agent’s enter, and the present simulation state to provide a sensible, context-appropriate response. No handwritten fixtures required.

Your workflow follows three steps: beautify and register your instruments, optionally steer the simulation with context, then let ToolSimulator mock the device responses when your agent runs.

Determine 2: The three-step ToolSimulator (TS) workflow — Beautify & Register, Steer, Mock

Getting began with ToolSimulator

The next sections stroll you thru every step of the ToolSimulator workflow, from preliminary setup to working your first simulation.

Step 1: Beautify and register

Create a ToolSimulator occasion, then wrap your device operate with the @simulator.device() decorator to register it for simulation. The actual operate physique can stay empty. ToolSimulator intercepts calls earlier than they attain the implementation:

from strands_evals.simulation.tool_simulator import ToolSimulator

tool_simulator = ToolSimulator()

@tool_simulator.device()
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
    """Seek for obtainable flights between two airports on a given date."""
    go # The actual implementation isn't referred to as throughout simulation

Step 2: Steer (elective configuration)

By default, ToolSimulator robotically infers how every device ought to behave from its schema and docstring. No extra configuration is required to get began. Once you want extra management, you need to use these three elective parameters to customise simulation conduct:

share_state_id: Hyperlinks instruments that share the identical backend underneath a standard state key. State adjustments made by one device (for instance, a setter) are instantly seen to subsequent calls by one other (for instance, a getter).
initial_state_description: Seeds the simulation with a pure language description of pre-existing state. Richer context produces extra real looking and constant responses.
output_schema: A Pydantic mannequin defining the anticipated response construction. ToolSimulator generates responses that conform strictly to this schema.

Step 3: Mock

When your agent calls a registered device, the ToolSimulator wrapper intercepts the decision and routes it to the dynamic response generator. The generator validates the agent’s parameters in opposition to the device schema, produces a response that matches the output_schema, and updates the state registry so subsequent device calls see a constant world.

Determine 3: The ToolSimulator (TS) simulation move when the agent calls a registered device

The next instance simulates a flight search device hooked up to a flight search assistant:

from strands import Agent
from strands_evals.simulation.tool_simulator import ToolSimulator

# 1. Create a simulator occasion
tool_simulator = ToolSimulator()

# 2. Register a device for simulation with preliminary state context
@tool_simulator.device(
    initial_state_description="Flight database: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. Costs vary from $180 to $420.",
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
    """Seek for obtainable flights between two airports on a given date."""
    go

# 3. Create an agent with the simulated device and run it
flight_tool = tool_simulator.get_tool("search_flights")
agent = Agent(
    system_prompt="You're a flight search assistant.",
    instruments=[flight_tool],
)

response = agent("Discover me flights from Seattle to New York on March 15.")
print(response)
# Anticipated output: A structured checklist of simulated SEA->JFK flights with occasions
# and costs in step with the initial_state_description you supplied.

Superior ToolSimulator utilization

The next sections cowl three superior capabilities that offer you extra management over simulation conduct: working unbiased situations for parallel testing, configuring shared state for multi-turn workflows, and implementing customized response schemas.

Run unbiased simulator situations

You possibly can create a number of ToolSimulator situations facet by facet. Every occasion maintains its personal device registry and state, so you may run parallel experiment configurations in the identical codebase:

simulator_a = ToolSimulator()
simulator_b = ToolSimulator()
# Every occasion has an unbiased device registry and state --
# perfect for evaluating agent conduct throughout totally different device setups.

Configure shared state for multi-turn workflows

For stateful instruments resembling database getters and setters, ToolSimulator maintains constant shared state throughout device calls. Use share_state_id to hyperlink instruments that function on the identical backend, and initial_state_description to seed the simulation with pre-existing context:

@tool_simulator.device(
    share_state_id="flight_booking",
    initial_state_description="Flight reserving system: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. No bookings at the moment lively.",
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
    """Seek for obtainable flights between two airports on a given date."""
    go

@tool_simulator.device(
share_state_id="flight_booking",
)
def get_booking_status(booking_id: str) -> dict:
    """Retrieve the present standing of a flight reserving by reserving ID."""
    go

# Each instruments share "flight_booking" state.
# When search_flights known as, get_booking_status sees the identical
# flight availability information in subsequent calls.

Examine the state earlier than and after agent execution to validate that device interactions produced the anticipated adjustments:

initial_state = tool_simulator.get_state("flight_booking")
# ... run the agent ...
final_state = tool_simulator.get_state("flight_booking")
# Confirm not simply the ultimate output, however the full sequence of device interactions.

Tip: Seeding state from actual information

As a result of initial_state_description accepts pure language, you will get artistic with the way you seed context. For instruments that work together with tabular information, use a DataFrame.describe() name to generate statistical summaries and go these statistics instantly because the state description. ToolSimulator will generate responses that mirror real looking information distributions, with out ever accessing the precise information.

Implement a customized response schema

By default, ToolSimulator infers a response construction from the device’s docstring and kind hints. For instruments that observe strict specs resembling OpenAPI or MCP schemas, outline the anticipated response as a Pydantic mannequin and go it utilizing output_schema:

from pydantic import BaseModel, Subject

class FlightSearchResponse(BaseModel):
    flights: checklist[dict] = Subject( ..., description="Listing of accessible flights with flight quantity, departure time, and value" )
    origin: str = Subject(..., description="Origin airport code")
    vacation spot: str = Subject(..., description="Vacation spot airport code")
    standing: str = Subject(default="success", description="Search operation standing")
    message: str = Subject(default="", description="Extra standing message")

@tool_simulator.device(output_schema=FlightSearchResponse)
def search_flights(origin: str, vacation spot: str, date: str) -> dict:
    """Seek for obtainable flights between two airports on a given date."""
    go

# ToolSimulator validates parameters strictly and returns solely legitimate JSON
# responses that conform to the FlightSearchResponse schema.

Integration with Strands Analysis pipelines

ToolSimulator suits naturally into the Strands Evals analysis framework. The next instance reveals a whole pipeline, from simulation setup to experiment report, utilizing the GoalSuccessRateEvaluator to attain agent efficiency on tool-calling duties:

from typing import Any
from pydantic import BaseModel, Subject
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import GoalSuccessRateEvaluator
from strands_evals.simulation.tool_simulator import ToolSimulator
from strands_evals.mappers import StrandsInMemorySessionMapper
from strands_evals.telemetry import StrandsEvalsTelemetry

# Arrange telemetry and power simulator
telemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()
memory_exporter = telemetry.in_memory_exporter
tool_simulator = ToolSimulator()

# Outline the response schema
class FlightSearchResponse(BaseModel):
    flights: checklist[dict] = Subject( ..., description="Accessible flights with quantity, departure time, and value" )
    origin: str = Subject(..., description="Origin airport code")
    vacation spot: str = Subject(..., description="Vacation spot airport code")
    standing: str = Subject(default="success", description="Search operation standing")
    message: str = Subject(default="", description="Extra standing message")

# Register instruments for simulation
@tool_simulator.device(
    share_state_id="flight_booking",
    initial_state_description="Flight reserving system: SEA->JFK flights at 8am, 12pm, and 6pm. No bookings at the moment lively.",
    output_schema=FlightSearchResponse,
)
def search_flights(origin: str, vacation spot: str, date: str) -> dict[str, Any]:
    """Seek for obtainable flights between two airports on a given date."""
    go

@tool_simulator.device(share_state_id="flight_booking")
def get_booking_status(booking_id: str) -> dict[str, Any]:
    """Retrieve the present standing of a flight reserving by reserving ID."""
    go

# Outline the analysis process
def user_task_function(case: Case) -> dict:
    initial_state = tool_simulator.get_state("flight_booking")
    print(f"[State before]: {initial_state.get('initial_state')}")

    search_tool = tool_simulator.get_tool("search_flights")
    status_tool = tool_simulator.get_tool("get_booking_status")
    agent = Agent(
        trace_attributes={ "gen_ai.dialog.id": case.session_id, "session.id": case.session_id },
        system_prompt="You're a flight reserving assistant.",
        instruments=[search_tool, status_tool],
        callback_handler=None,
    )

    agent_response = agent(case.enter)
    print(f"[User]: {case.enter}")
    print(f"[Agent]: {agent_response}")

    final_state = tool_simulator.get_state("flight_booking")
    print(f"[State after]: {final_state.get('previous_calls', [])}")

    finished_spans = memory_exporter.get_finished_spans()
    mapper = StrandsInMemorySessionMapper()
    session = mapper.map_to_session(finished_spans, session_id=case.session_id)
    return {"output": str(agent_response), "trajectory": session}

# Outline check circumstances, run the experiment, and show the report
test_cases = [
    Case( name="flight_search", input="Find me flights from Seattle to New York on March 15.", metadata={"category": "flight_booking"}, ),
]
experiment = Experiment[str, str](
    circumstances=test_cases,
    evaluators=[GoalSuccessRateEvaluator()]
)

experiences = experiment.run_evaluations(user_task_function)
experiences[0].run_display()

The duty operate retrieves the simulated instruments, creates an agent, runs the interplay, and returns each the agent’s output and the total telemetry trajectory. The trajectory offers evaluators like GoalSuccessRateEvaluator entry to the entire sequence of device calls and mannequin invocations, not simply the ultimate response.

Greatest practices for simulation-based analysis

The next practices aid you get probably the most out of ToolSimulator throughout growth and analysis workflows:

Begin with the default configuration for broad protection. Add configuration overrides just for the particular device environments that you just wish to management exactly. ToolSimulator’s defaults are designed to provide real looking conduct with out requiring setup.
Present wealthy initial_state_description values for stateful instruments. The extra context that you just seed, the extra real looking and constant the simulated responses will probably be. Embody information ranges, entity counts, and relationship context.
Use share_state_id for instruments that work together with the identical backend, so write operations are seen to subsequent reads. That is important for testing multi-turn workflows like reserving, cart administration, or database updates.
Apply output_schema for instruments that observe strict specs, resembling OpenAPI or MCP schemas. Schema enforcement catches malformed responses earlier than they attain your agent and break your post-processing layer.
Validate device interplay sequences, not simply last outputs. Examine state adjustments earlier than and after agent execution to verify that device calls occurred in the fitting order and produced the fitting state transitions.
Begin small and develop. Start along with your commonest device interplay situations, then develop to edge circumstances as your analysis follow matures. Complement simulation-based testing with focused dwell API exams for crucial manufacturing paths.

Conclusion

ToolSimulator transforms the way you check AI brokers by changing dangerous dwell API calls with clever, adaptive simulations. Now you can safely validate advanced, stateful workflows at scale, catching integration bugs early and delivery production-ready brokers with confidence. Combining ToolSimulator with Strands Evals analysis pipelines offers you full visibility into agent conduct with out managing check infrastructure or risking real-world uncomfortable side effects.

Subsequent steps

Begin testing your AI brokers safely immediately. Set up ToolSimulator with the next command:

pip set up strands-evals

To proceed exploring ToolSimulator and Strands Evals, take these subsequent steps:

Learn the Strands Evals documentation to discover all configuration choices, together with superior state administration and customized evaluators.
Attempt the instance to see ToolSimulator in motion. Prolong the instance by including extra instruments and testing multi-step agent workflows.
Discover Amazon Bedrock for the LLM backend choices that energy ToolSimulator’s response technology.
Find out about AWS Lambda for serverless agent deployment methods that pair effectively with ToolSimulator-based testing.
Be a part of the Strands group boards to ask questions, share your analysis setups, and join with different agent builders.

Share your suggestions. We’d love to listen to the way you’re utilizing ToolSimulator. Share your suggestions, report points, and counsel options by means of the Strands Evals GitHub repository or group boards.