You should use ToolSimulator, <\/strong>an\u00a0LLM-powered device simulation framework inside Strands Evals, to totally and safely check AI brokers that depend on exterior instruments, at scale. As a substitute of risking dwell API calls that expose personally identifiable data (PII), set off unintended actions, or settling for static mocks that break with multi-turn workflows, you need to use ToolSimulator\u2019s giant language mannequin (LLM)-powered simulations to validate your brokers. Accessible immediately as a part of the Strands Evals Software program Growth Package (SDK)<\/a>, ToolSimulator helps you catch integration bugs early, check edge circumstances comprehensively, and ship production-ready brokers with confidence.<\/p>\n\n\n\n
On this submit, you’ll discover ways to:<\/strong><\/p>\n
\n
Arrange ToolSimulator and register instruments for simulation<\/li>\n
Configure stateful device simulations for multi-turn agent workflows<\/li>\n
Implement response schemas with Pydantic fashions<\/li>\n
Combine ToolSimulator into a whole Strands Evals analysis pipeline<\/li>\n
Apply greatest practices for simulation-based agent analysis<\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n
Conditions<\/h2>\n
Earlier than you start, just be sure you have the next:<\/p>\n
\n
Python 3.10 or later put in in your surroundings<\/li>\n
Strands Evals SDK put in: pip set up strands-evals<\/code><\/li>\n
Primary familiarity with Python, together with decorators and kind hints<\/li>\n
Familiarity with AI brokers and tool-calling ideas (API calls, operate schemas)<\/li>\n
Pydantic data is useful for the superior schema examples, however isn’t required to get began<\/li>\n
An AWS account isn’t required to run ToolSimulator regionally<\/li>\n<\/ul>\nWhy device testing challenges your growth workflow<\/h2>\nFashionable AI brokers don\u2019t simply cause. They name APIs, question databases, invoke Mannequin Context Protocol (MCP) companies, and work together with exterior programs to finish duties. Your agent\u2019s conduct relies upon not solely on its reasoning, however on what these instruments return. Once you check these brokers in opposition to dwell APIs, you run into three challenges that gradual you down and put your programs in danger.<\/p>\n Three challenges that dwell APIs create:<\/p>\n \nExterior dependencies gradual you down. <\/strong>Stay APIs impose charge limits, expertise downtime, and require community connectivity. Once you\u2019re working lots of of check circumstances, these constraints make complete testing impractical.<\/li>\n Check isolation turns into dangerous. <\/strong>Actual device calls set off actual uncomfortable side effects. You threat sending precise emails, modifying manufacturing databases, or reserving precise flights throughout testing. Your agent exams shouldn\u2019t work together with the programs that they\u2019re testing in opposition to.<\/li>\n Privateness and safety create limitations. <\/strong>Many instruments deal with delicate information, together with consumer data, monetary data, and PII. Working exams in opposition to dwell programs unnecessarily exposes that information and creates compliance dangers.<\/li>\n<\/ul>\nWhy static mocks fall brief<\/h2>\nYou may think about static mocks instead. Static mocks work for straightfoward, predictable situations, however they require fixed upkeep as your APIs evolve. Extra importantly, they break down within the multi-turn, stateful workflows that actual brokers carry out.<\/p>\n Think about a flight reserving agent. It searches for flights with one device name, then checks reserving standing with one other. The second response ought to depend upon what the primary name did. A hardcoded response can\u2019t mirror a database that adjustments state between calls. Static mocks can\u2019t seize this.<\/p>\n What makes ToolSimulator totally different<\/h2>\nToolSimulator solves these challenges with three important capabilities that work collectively to provide you protected, scalable agent testing with out sacrificing realism.<\/p>\n \nAdaptive response technology. <\/strong>Device outputs mirror what your agent really requested, not a set template. When your agent calls to seek for Seattle-to-New York flights, ToolSimulator returns believable choices with real looking costs and occasions, not a generic placeholder.<\/li>\n Stateful workflow help. <\/strong>Many real-world instruments preserve state throughout calls. A write operation ought to have an effect on subsequent reads. ToolSimulator maintains constant shared state throughout device calls, making it protected to check database interactions, reserving workflows, and multi-step processes with out touching manufacturing programs.<\/li>\n Schema enforcement. <\/strong>Builders sometimes add a post-processing layer that parses uncooked device output right into a structured format. When a device returns a malformed response, this layer breaks. ToolSimulator validates responses in opposition to Pydantic schemas that you just outline, catching malformed responses earlier than they attain your agent.<\/li>\n<\/ul>\nHow ToolSimulator works<\/h2>\nDetermine 1: ToolSimulator (TS) intercepts device calls and routes them to an LLM-based response generator<\/p>\n ToolSimulator intercepts calls to your registered instruments and routes them to an LLM-based response generator. The generator makes use of the device schema, your agent\u2019s enter, and the present simulation state to provide a sensible, context-appropriate response. No handwritten fixtures required.<\/p>\n Your workflow follows three steps: beautify and register your instruments, optionally steer the simulation with context, then let ToolSimulator mock the device responses when your agent runs.<\/p>\n Determine 2: The three-step ToolSimulator (TS) workflow \u2014 Beautify & Register, Steer, Mock<\/p>\n Getting began with ToolSimulator<\/h2>\nThe next sections stroll you thru every step of the ToolSimulator workflow, from preliminary setup to working your first simulation.<\/p>\n Step 1: Beautify and register<\/h3>\nCreate a ToolSimulator occasion, then wrap your device operate with the @simulator.device()<\/code> decorator to register it for simulation. The actual operate physique can stay empty. ToolSimulator intercepts calls earlier than they attain the implementation:<\/p>\n from strands_evals.simulation.tool_simulator import ToolSimulator\n\ntool_simulator = ToolSimulator()\n\n@tool_simulator.device()\ndef search_flights(origin: str, vacation spot: str, date: str) -> dict:\n \"\"\"Seek for obtainable flights between two airports on a given date.\"\"\"\n go # The actual implementation isn't referred to as throughout simulation<\/code><\/pre>\nStep 2: Steer (elective configuration)<\/h3>\nBy default, ToolSimulator robotically infers how every device ought to behave from its schema and docstring. No extra configuration is required to get began. Once you want extra management, you need to use these three elective parameters to customise simulation conduct:<\/p>\n \nshare_state_id<\/code>: Hyperlinks instruments that share the identical backend underneath a standard state key. State adjustments made by one device (for instance, a setter) are instantly seen to subsequent calls by one other (for instance, a getter).<\/li>\n initial_state_description<\/code>: Seeds the simulation with a pure language description of pre-existing state. Richer context produces extra real looking and constant responses.<\/li>\n output_schema<\/code>: A Pydantic mannequin defining the anticipated response construction. ToolSimulator generates responses that conform strictly to this schema.<\/li>\n<\/ul>\nStep 3: Mock<\/h3>\nWhen your agent calls a registered device, the ToolSimulator wrapper intercepts the decision and routes it to the dynamic response generator. The generator validates the agent\u2019s parameters in opposition to the device schema, produces a response that matches the output_schema<\/code>, and updates the state registry so subsequent device calls see a constant world.<\/p>\n Determine 3: The ToolSimulator (TS) simulation move when the agent calls a registered device<\/p>\n The next instance simulates a flight search device hooked up to a flight search assistant:<\/p>\n from strands import Agent\nfrom strands_evals.simulation.tool_simulator import ToolSimulator\n\n# 1. Create a simulator occasion\ntool_simulator = ToolSimulator()\n\n# 2. Register a device for simulation with preliminary state context\n@tool_simulator.device(\n initial_state_description=\"Flight database: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. Costs vary from $180 to $420.\",\n)\ndef search_flights(origin: str, vacation spot: str, date: str) -> dict:\n \"\"\"Seek for obtainable flights between two airports on a given date.\"\"\"\n go\n\n# 3. Create an agent with the simulated device and run it\nflight_tool = tool_simulator.get_tool(\"search_flights\")\nagent = Agent(\n system_prompt=\"You're a flight search assistant.\",\n instruments=[flight_tool],\n)\n\nresponse = agent(\"Discover me flights from Seattle to New York on March 15.\")\nprint(response)\n# Anticipated output: A structured checklist of simulated SEA->JFK flights with occasions\n# and costs in step with the initial_state_description you supplied.<\/code><\/pre>\nSuperior ToolSimulator utilization<\/h2>\nThe next sections cowl three superior capabilities that offer you extra management over simulation conduct: working unbiased situations for parallel testing, configuring shared state for multi-turn workflows, and implementing customized response schemas.<\/p>\n Run unbiased simulator situations<\/h3>\nYou possibly can create a number of ToolSimulator situations facet by facet. Every occasion maintains its personal device registry and state, so you may run parallel experiment configurations in the identical codebase:<\/p>\n simulator_a = ToolSimulator()\nsimulator_b = ToolSimulator()\n# Every occasion has an unbiased device registry and state --\n# perfect for evaluating agent conduct throughout totally different device setups.<\/code><\/pre>\nConfigure shared state for multi-turn workflows<\/h3>\nFor stateful instruments resembling database getters and setters, ToolSimulator maintains constant shared state throughout device calls. Use share_state_id<\/code> to hyperlink instruments that function on the identical backend, and initial_state_description<\/code> to seed the simulation with pre-existing context:<\/p>\n @tool_simulator.device(\n share_state_id=\"flight_booking\",\n initial_state_description=\"Flight reserving system: SEA->JFK flights obtainable at 8am, 12pm, and 6pm. No bookings at the moment lively.\",\n)\ndef search_flights(origin: str, vacation spot: str, date: str) -> dict:\n \"\"\"Seek for obtainable flights between two airports on a given date.\"\"\"\n go\n\n@tool_simulator.device(\nshare_state_id=\"flight_booking\",\n)\ndef get_booking_status(booking_id: str) -> dict:\n \"\"\"Retrieve the present standing of a flight reserving by reserving ID.\"\"\"\n go\n\n# Each instruments share \"flight_booking\" state.\n# When search_flights known as, get_booking_status sees the identical\n# flight availability information in subsequent calls.<\/code><\/pre>\nExamine the state earlier than and after agent execution to validate that device interactions produced the anticipated adjustments:<\/p>\n initial_state = tool_simulator.get_state(\"flight_booking\")\n# ... run the agent ...\nfinal_state = tool_simulator.get_state(\"flight_booking\")\n# Confirm not simply the ultimate output, however the full sequence of device interactions.<\/code><\/pre>\n\n\n\n\nTip:\u00a0<\/strong>Seeding state from actual information<\/strong><\/p>\n As a result of initial_state_description<\/code> accepts pure language, you will get artistic with the way you seed context. For instruments that work together with tabular information, use a DataFrame.describe()<\/code> name to generate statistical summaries and go these statistics instantly because the state description. ToolSimulator will generate responses that mirror real looking information distributions, with out ever accessing the precise information.<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n Implement a customized response schema<\/h3>\nBy default, ToolSimulator infers a response construction from the device\u2019s docstring and kind hints. For instruments that observe strict specs resembling OpenAPI or MCP schemas, outline the anticipated response as a Pydantic mannequin and go it utilizing output_schema<\/code>:<\/p>\n from pydantic import BaseModel, Subject\n\nclass FlightSearchResponse(BaseModel):\n flights: checklist[dict] = Subject( ..., description=\"Listing of accessible flights with flight quantity, departure time, and value\" )\n origin: str = Subject(..., description=\"Origin airport code\")\n vacation spot: str = Subject(..., description=\"Vacation spot airport code\")\n standing: str = Subject(default=\"success\", description=\"Search operation standing\")\n message: str = Subject(default=\"\", description=\"Extra standing message\")\n\n@tool_simulator.device(output_schema=FlightSearchResponse)\ndef search_flights(origin: str, vacation spot: str, date: str) -> dict:\n \"\"\"Seek for obtainable flights between two airports on a given date.\"\"\"\n go\n\n# ToolSimulator validates parameters strictly and returns solely legitimate JSON\n# responses that conform to the FlightSearchResponse schema.<\/code><\/pre>\nIntegration with Strands Analysis pipelines<\/h2>\nToolSimulator suits naturally into the Strands Evals analysis framework. The next instance reveals a whole pipeline, from simulation setup to experiment report, utilizing the GoalSuccessRateEvaluator<\/code> to attain agent efficiency on tool-calling duties:<\/p>\n from typing import Any\nfrom pydantic import BaseModel, Subject\nfrom strands import Agent\nfrom strands_evals import Case, Experiment\nfrom strands_evals.evaluators import GoalSuccessRateEvaluator\nfrom strands_evals.simulation.tool_simulator import ToolSimulator\nfrom strands_evals.mappers import StrandsInMemorySessionMapper\nfrom strands_evals.telemetry import StrandsEvalsTelemetry\n\n# Arrange telemetry and power simulator\ntelemetry = StrandsEvalsTelemetry().setup_in_memory_exporter()\nmemory_exporter = telemetry.in_memory_exporter\ntool_simulator = ToolSimulator()\n\n# Outline the response schema\nclass FlightSearchResponse(BaseModel):\n flights: checklist[dict] = Subject( ..., description=\"Accessible flights with quantity, departure time, and value\" )\n origin: str = Subject(..., description=\"Origin airport code\")\n vacation spot: str = Subject(..., description=\"Vacation spot airport code\")\n standing: str = Subject(default=\"success\", description=\"Search operation standing\")\n message: str = Subject(default=\"\", description=\"Extra standing message\")\n\n# Register instruments for simulation\n@tool_simulator.device(\n share_state_id=\"flight_booking\",\n initial_state_description=\"Flight reserving system: SEA->JFK flights at 8am, 12pm, and 6pm. No bookings at the moment lively.\",\n output_schema=FlightSearchResponse,\n)\ndef search_flights(origin: str, vacation spot: str, date: str) -> dict[str, Any]:\n \"\"\"Seek for obtainable flights between two airports on a given date.\"\"\"\n go\n\n@tool_simulator.device(share_state_id=\"flight_booking\")\ndef get_booking_status(booking_id: str) -> dict[str, Any]:\n \"\"\"Retrieve the present standing of a flight reserving by reserving ID.\"\"\"\n go\n\n# Outline the analysis process\ndef user_task_function(case: Case) -> dict:\n initial_state = tool_simulator.get_state(\"flight_booking\")\n print(f\"[State before]: {initial_state.get('initial_state')}\")\n\n search_tool = tool_simulator.get_tool(\"search_flights\")\n status_tool = tool_simulator.get_tool(\"get_booking_status\")\n agent = Agent(\n trace_attributes={ \"gen_ai.dialog.id\": case.session_id, \"session.id\": case.session_id },\n system_prompt=\"You're a flight reserving assistant.\",\n instruments=[search_tool, status_tool],\n callback_handler=None,\n )\n\n agent_response = agent(case.enter)\n print(f\"[User]: {case.enter}\")\n print(f\"[Agent]: {agent_response}\")\n\n final_state = tool_simulator.get_state(\"flight_booking\")\n print(f\"[State after]: {final_state.get('previous_calls', [])}\")\n\n finished_spans = memory_exporter.get_finished_spans()\n mapper = StrandsInMemorySessionMapper()\n session = mapper.map_to_session(finished_spans, session_id=case.session_id)\n return {\"output\": str(agent_response), \"trajectory\": session}\n\n# Outline check circumstances, run the experiment, and show the report\ntest_cases = [\n Case( name=\"flight_search\", input=\"Find me flights from Seattle to New York on March 15.\", metadata={\"category\": \"flight_booking\"}, ),\n]\nexperiment = Experiment[str, str](\n circumstances=test_cases,\n evaluators=[GoalSuccessRateEvaluator()]\n)\n\nexperiences = experiment.run_evaluations(user_task_function)\nexperiences[0].run_display()<\/code><\/pre>\nThe duty operate retrieves the simulated instruments, creates an agent, runs the interplay, and returns each the agent\u2019s output and the total telemetry trajectory. The trajectory offers evaluators like GoalSuccessRateEvaluator<\/code> entry to the entire sequence of device calls and mannequin invocations, not simply the ultimate response.<\/p>\n Greatest practices for simulation-based analysis<\/h2>\nThe next practices aid you get probably the most out of ToolSimulator throughout growth and analysis workflows:<\/p>\n \nBegin with the default configuration for broad protection.<\/strong> Add configuration overrides just for the particular device environments that you just wish to management exactly. ToolSimulator\u2019s defaults are designed to provide real looking conduct with out requiring setup.<\/li>\n Present wealthy <\/strong>initial_state_description<\/code> values for stateful instruments.<\/strong> The extra context that you just seed, the extra real looking and constant the simulated responses will probably be. Embody information ranges, entity counts, and relationship context.<\/li>\n Use <\/strong>share_state_id<\/code> for instruments that work together with the identical backend,<\/strong> so write operations are seen to subsequent reads. That is important for testing multi-turn workflows like reserving, cart administration, or database updates.<\/li>\n Apply <\/strong>output_schema<\/code> for instruments that observe strict specs,<\/strong> resembling OpenAPI or MCP schemas. Schema enforcement catches malformed responses earlier than they attain your agent and break your post-processing layer.<\/li>\n Validate device interplay sequences, not simply last outputs.<\/strong> Examine state adjustments earlier than and after agent execution to verify that device calls occurred in the fitting order and produced the fitting state transitions.<\/li>\n Begin small and develop.<\/strong> Start along with your commonest device interplay situations, then develop to edge circumstances as your analysis follow matures. Complement simulation-based testing with focused dwell API exams for crucial manufacturing paths.<\/li>\n<\/ol>\nConclusion<\/h2>\nToolSimulator transforms the way you check AI brokers by changing dangerous dwell API calls with clever, adaptive simulations. Now you can safely validate advanced, stateful workflows at scale, catching integration bugs early and delivery production-ready brokers with confidence. Combining ToolSimulator with Strands Evals analysis pipelines offers you full visibility into agent conduct with out managing check infrastructure or risking real-world uncomfortable side effects.<\/p>\n Subsequent steps<\/h3>\nBegin testing your AI brokers safely immediately. Set up ToolSimulator with the next command:<\/p>\n pip set up strands-evals<\/code><\/pre>\nTo proceed exploring ToolSimulator and Strands Evals, take these subsequent steps:<\/p>\n \nLearn the Strands Evals documentation<\/a> to discover all configuration choices, together with superior state administration and customized evaluators.<\/li>\n Attempt the instance<\/a> to see ToolSimulator in motion. Prolong the instance by including extra instruments and testing multi-step agent workflows.<\/li>\n Discover Amazon Bedrock<\/a> for the LLM backend choices that energy ToolSimulator\u2019s response technology.<\/li>\n Find out about AWS Lambda<\/a> for serverless agent deployment methods that pair effectively with ToolSimulator-based testing.<\/li>\n Be a part of the Strands group boards to ask questions, share your analysis setups, and join with different agent builders.<\/li>\n<\/ul>\n\n\n\nShare your suggestions. <\/strong>We\u2019d love to listen to the way you\u2019re utilizing ToolSimulator. Share your suggestions, report points, and counsel options by means of the Strands Evals GitHub repository or group boards.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n\nAbout The Authors<\/h2>\n\n\n\n \n <\/div>\nDarren Wang<\/h3>\nDarren Wang is a Analysis Engineer at Amazon Internet Companies, the place he bridges cutting-edge AI analysis and manufacturing programs. With a Ph.D. background in speech recognition and 5 years of expertise in electronic mail anti-spam engineering, Darren transforms early-stage machine studying analysis into scalable, production-ready options that ship measurable buyer influence. Specializing in agent simulation and analysis frameworks, he empowers builders to construct extra dependable, testable AI brokers by means of sturdy testing infrastructure. Outdoors of labor, he enjoys bouldering, enjoying violin, and something about cats.<\/p>\n<\/p><\/div>\n \n\n \n <\/div>\nXuan Qi<\/h3>\nXuan Qi is an Utilized Scientist at Amazon Internet Companies, the place she applies her background in physics to sort out advanced challenges in machine studying and synthetic intelligence. Specializing in ML modeling and simulation, Xuan is obsessed with translating scientific ideas into sensible functions that drive significant technological developments. Her work focuses on creating extra intuitive and environment friendly AI programs that may higher perceive and work together with the world. Outdoors of her skilled pursuits, Xuan finds steadiness and creativity by means of dancing and enjoying the violin, bringing the precision and concord of those arts into her scientific endeavors.<\/p>\n<\/p><\/div>\n \n\n \n <\/div>\nSmeet Dhakecha<\/h3>\nSmeet Dhakecha is a Analysis Engineer at Amazon Internet Companies, working throughout the Agentic AI Science crew. His work spans agent simulation and analysis programs, in addition to the design and deployment of information transformation pipelines and to help fast-moving scientific analysis for mannequin post-training, and RL coaching.<\/p>\n<\/p><\/div>\n \n\n \n <\/div>\nVinayak Arannil<\/h3>\nVinayak is a Sr. Utilized Scientist at Amazon Internet Companies. With a number of years of expertise, he has labored on numerous domains of AI like laptop imaginative and prescient, pure language processing, suggestion programs and many others. At present, Vinayak helps construct new capabilities on the AgentCore and Strands, enabling prospects to guage their Agentic functions with ease, accuracy and effectivity.<\/p>\n<\/p><\/div>\n<\/footer>\n \n <\/div>\n\n","protected":false},"excerpt":{"rendered":" You should use ToolSimulator, an\u00a0LLM-powered device simulation framework inside Strands Evals, to totally and safely check AI brokers that depend on exterior instruments, at scale. As a substitute of risking dwell API calls that expose personally identifiable data (PII), set off unintended actions, or settling for static mocks that break with multi-turn workflows, you need […]<\/p>\n","protected":false},"author":2,"featured_media":13970,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[617,739,508,509,8748],"class_list":["post-13968","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-agents","tag-scalable","tag-testing","tag-tool","tag-toolsimulator"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13968","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=13968"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13968\/revisions"}],"predecessor-version":[{"id":13969,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/13968\/revisions\/13969"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/13970"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=13968"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=13968"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=13968"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}