{"id":1565,"date":"2025-04-19T19:41:39","date_gmt":"2025-04-19T19:41:39","guid":{"rendered":"https:\/\/techtrendfeed.com\/?p=1565"},"modified":"2025-04-19T19:41:39","modified_gmt":"2025-04-19T19:41:39","slug":"load-testing-llms-utilizing-llmperf-in-direction-of-information-science","status":"publish","type":"post","link":"https:\/\/techtrendfeed.com\/?p=1565","title":{"rendered":"Load-Testing LLMs Utilizing LLMPerf | In direction of Information Science"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p class=\"wp-block-paragraph\"> Language Mannequin (LLM) just isn&#8217;t essentially the ultimate step in productionizing your Generative AI software. An usually forgotten, but essential a part of the MLOPs lifecycle is correctly <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.opentext.com\/what-is\/load-testing\">load testing<\/a> your LLM and guaranteeing it is able to stand up to your anticipated manufacturing site visitors. Load testing at a excessive degree is the observe of testing your software or on this case your mannequin with the site visitors it could expect in a manufacturing setting to make sure that it\u2019s performant.<\/p>\n<p class=\"wp-block-paragraph\">Previously we\u2019ve mentioned <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/why-load-testing-is-essential-to-take-your-ml-app-to-production-faab0df1c4e1\/\">load testing conventional ML fashions<\/a> utilizing open supply Python instruments akin to <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/locust.io\/\">Locust<\/a>. Locust helps seize normal efficiency metrics akin to requests per second (RPS) and latency percentiles on a per request foundation. Whereas that is efficient with extra conventional APIs and ML fashions it doesn\u2019t seize the complete story for LLMs.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">LLMs historically have a a lot decrease RPS and better latency than conventional ML fashions on account of their measurement and bigger compute necessities. Typically the RPS metric does probably not present essentially the most correct image both as requests can enormously differ relying on the enter to the LLM. As an example you may need a question asking to summarize a big chunk of textual content and one other question that may require a one-word response.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">For this reason <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/learn.microsoft.com\/en-us\/dotnet\/ai\/conceptual\/understanding-tokens\">tokens<\/a> are seen as a way more correct illustration of an LLM\u2019s efficiency. At a excessive degree a token is a piece of textual content, at any time when an LLM is processing your enter it \u201ctokenizes\u201d the enter. A token differs relying particularly on the LLM you&#8217;re utilizing, however you&#8217;ll be able to think about it for example as a phrase, sequence of phrases, or characters in essence.<\/p>\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-58-1024x198.png\" alt=\"\" class=\"wp-image-601771\"\/><figcaption class=\"wp-element-caption\">Picture by Writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">What we\u2019ll do on this article is discover how we are able to generate token based mostly metrics so we are able to perceive how your LLM is acting from a serving\/deployment perspective. After this text you\u2019ll have an concept of how one can arrange a load-testing device particularly to benchmark totally different LLMs within the case that you&#8217;re evaluating many fashions or totally different deployment configurations or a mix of each.<\/p>\n<p class=\"wp-block-paragraph\">Let\u2019s get arms on! In case you are extra of a video based mostly learner be at liberty to observe my corresponding YouTube video down beneath:<\/p>\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\">\n<div class=\"jeg_video_container jeg_video_content\"><iframe loading=\"lazy\" title=\"2025 Guide to Load Testing LLMs | Claude Sonnet on Amazon Bedrock\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube.com\/embed\/AbirlC9gLUE?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/div>\n<\/figure>\n<p class=\"wp-block-paragraph\"><strong>NOTE<\/strong>: This text assumes a fundamental understanding of Python, LLMs, and Amazon Bedrock\/SageMaker. In case you are new to Amazon Bedrock please confer with my starter information <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=8aMJUV0qhow&amp;t=3s\">right here<\/a>. If you wish to study extra about SageMaker JumpStart LLM deployments confer with the video <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=c0ASHUm3BwA&amp;t=636s\">right here<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><strong>DISCLAIMER<\/strong>: I&#8217;m a Machine Studying Architect at AWS and my opinions are my very own.<\/p>\n<h3 class=\"wp-block-heading\">Desk of Contents<\/h3>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">LLM Particular Metrics<\/li>\n<li class=\"wp-block-list-item\">LLMPerf Intro<\/li>\n<li class=\"wp-block-list-item\">Making use of LLMPerf to Amazon Bedrock<\/li>\n<li class=\"wp-block-list-item\">Extra Sources &amp; Conclusion<\/li>\n<\/ol>\n<h2 class=\"wp-block-heading\">LLM-Particular Metrics<\/h2>\n<p class=\"wp-block-paragraph\">As we briefly mentioned within the introduction with regard to LLM internet hosting, token based mostly metrics typically present a significantly better illustration of how your LLM is responding to totally different payload sizes or kinds of queries (summarization vs QnA).\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Historically we now have all the time tracked RPS and latency which we are going to nonetheless see right here nonetheless, however extra so at a token degree. Listed here are among the metrics to concentrate on earlier than we get began with load testing:<\/p>\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Time to First Token<\/strong>: That is the length it takes for the primary token to generate. That is particularly helpful when streaming. As an example when utilizing ChatGPT we begin processing data when the primary piece of textual content (token) seems.<\/li>\n<li class=\"wp-block-list-item\"><strong>Complete Output Tokens Per Second<\/strong>: That is the full variety of tokens generated per second, you&#8217;ll be able to consider this as a extra granular different to the requests per second we historically monitor.<\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">These are the foremost metrics that we\u2019ll give attention to, and there\u2019s a couple of others akin to inter-token latency that will even be displayed as a part of the load exams. Consider the parameters that additionally affect these metrics embrace the anticipated enter and output token measurement. We particularly play with these parameters to get an correct understanding of how our LLM performs in response to totally different era duties.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Now let\u2019s check out a device that allows us to toggle these parameters and show the related metrics we want.<\/p>\n<h2 class=\"wp-block-heading\">LLMPerf Intro<\/h2>\n<p class=\"wp-block-paragraph\">LLMPerf is constructed on high of <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/ray\">Ray<\/a>, a preferred distributed computing Python framework. LLMPerf particularly leverages Ray to create distributed load exams the place we are able to simulate real-time manufacturing degree site visitors.\u00a0<\/p>\n<p class=\"wp-block-paragraph\">Observe that any load-testing device can be solely going to have the ability to generate your anticipated quantity of site visitors if the consumer machine it&#8217;s on has sufficient compute energy to match your anticipated load. As an example as you scale the concurrency or throughput anticipated on your mannequin, you\u2019d additionally wish to scale the consumer machine(s) the place you&#8217;re working your load check.<\/p>\n<p class=\"wp-block-paragraph\">Now particularly inside <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/llmperf\">LLMPerf<\/a> there\u2019s a couple of parameters which can be uncovered which can be tailor-made for LLM load testing as we\u2019ve mentioned:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Mannequin<\/strong>: That is the mannequin supplier and your hosted mannequin that you just\u2019re working with. For our use-case it\u2019ll be <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/aws.amazon.com\/bedrock\/?trk=0eaabb80-ee46-4e73-94ae-368ffb759b62&amp;sc_channel=ps&amp;ef_id=Cj0KCQjwzYLABhD4ARIsALySuCRjoAi5pM0Mqz39YZd4i9YhVEBCQi7FFzshxslxIvrxgcl1lWipOvoaAl9BEALw_wcB:G:s&amp;s_kwcid=AL!4422!3!692006004688!p!!g!!amazon%20bedrock!21048268554!159639952935&amp;gclid=Cj0KCQjwzYLABhD4ARIsALySuCRjoAi5pM0Mqz39YZd4i9YhVEBCQi7FFzshxslxIvrxgcl1lWipOvoaAl9BEALw_wcB\">Amazon Bedrock<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.anthropic.com\/news\/claude-3-5-sonnet\">Claude 3 Sonnet<\/a> particularly.<\/li>\n<li class=\"wp-block-list-item\"><strong>LLM API<\/strong>: That is the API format wherein the payload must be structured. We use <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.litellm.ai\/\">LiteLLM<\/a> which supplies a standardized payload construction throughout totally different mannequin suppliers, thus simplifying the setup course of for us particularly if we wish to check totally different fashions hosted on totally different platforms.<\/li>\n<li class=\"wp-block-list-item\"><strong>Enter Tokens<\/strong>: The imply enter token size, you too can specify a regular deviation for this quantity.<\/li>\n<li class=\"wp-block-list-item\"><strong>Output Tokens<\/strong>: The imply output token size, you too can specify a regular deviation for this quantity.<\/li>\n<li class=\"wp-block-list-item\"><strong>Concurrent Requests<\/strong>: The variety of concurrent requests for the load check to simulate.<\/li>\n<li class=\"wp-block-list-item\"><strong>Take a look at Length<\/strong>: You possibly can management the length of the check, this parameter is enabled in seconds.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">LLMPerf particularly exposes all these parameters via their <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/llmperf\/blob\/main\/token_benchmark_ray.py\">token_benchmark_ray.py<\/a> script which we configure with our particular values. Let\u2019s have a look now at how we are able to configure this particularly for Amazon Bedrock.<\/p>\n<h2 class=\"wp-block-heading\">Making use of LLMPerf to Amazon Bedrock<\/h2>\n<h3 class=\"wp-block-heading\">Setup<\/h3>\n<p class=\"wp-block-paragraph\">For this instance we\u2019ll be working in a <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/nbi.html\" data-type=\"link\" data-id=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/nbi.html\">SageMaker Basic Pocket book Occasion<\/a> with a <strong>conda_python3 kernel<\/strong> and <strong>ml.g5.12xlarge<\/strong> occasion. Observe that you just wish to choose an occasion that has sufficient compute to generate the site visitors load that you just wish to simulate. Be sure that you even have your <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/cli\/v1\/userguide\/cli-configure-files.html\" data-type=\"link\" data-id=\"https:\/\/docs.aws.amazon.com\/cli\/v1\/userguide\/cli-configure-files.html\">AWS credentials<\/a> for LLMPerf to entry the hosted mannequin be it on Bedrock or SageMaker.<\/p>\n<h3 class=\"wp-block-heading\">LiteLLM Configuration<\/h3>\n<p class=\"wp-block-paragraph\">We first configure our LLM API construction of alternative which is LiteLLM on this case. With LiteLLM there\u2019s help throughout varied mannequin suppliers, on this case we configure the <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/docs.litellm.ai\/docs\/completion\">completion API<\/a> to work with Amazon Bedrock:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import os\nfrom litellm import completion\n\nos.environ[\"AWS_ACCESS_KEY_ID\"] = \"Enter your entry key ID\"\nos.environ[\"AWS_SECRET_ACCESS_KEY\"] = \"Enter your secret entry key\"\nos.environ[\"AWS_REGION_NAME\"] = \"us-east-1\"\n\nresponse = completion(\n    mannequin=\"anthropic.<a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/towardsdatascience.com\/tag\/claude\/\" title=\"claude\">claude<\/a>-3-sonnet-20240229-v1:0\",\n    messages=[{ \"content\": \"Who is Roger Federer?\",\"role\": \"user\"}]\n)\noutput = response.decisions[0].message.content material\nprint(output)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">To work with Bedrock we configure the Mannequin ID to level in direction of Claude 3 Sonnet and go in our immediate. The neat half with LiteLLM is that messages key has a constant format throughout mannequin suppliers.<\/p>\n<p class=\"wp-block-paragraph\">Submit-execution right here we are able to give attention to configuring LLMPerf for Bedrock particularly.<\/p>\n<h2 class=\"wp-block-heading\">LLMPerf Bedrock Integration<\/h2>\n<p class=\"wp-block-paragraph\">To execute a load check with LLMPerf we are able to merely use the supplied <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/ray-project\/llmperf\/blob\/main\/token_benchmark_ray.py\">token_benchmark_ray.py<\/a> script and go within the following parameters that we talked of earlier:<\/p>\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Enter Tokens Imply &amp; Customary Deviation<\/li>\n<li class=\"wp-block-list-item\">Output Tokens Imply &amp; Customary Deviation<\/li>\n<li class=\"wp-block-list-item\">Max variety of requests for check<\/li>\n<li class=\"wp-block-list-item\">Length of check<\/li>\n<li class=\"wp-block-list-item\">Concurrent requests<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">On this case we additionally specify our API format to be LiteLLM and we are able to execute the load check with a easy shell script like the next:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">%%sh\npython llmperf\/token_benchmark_ray.py \n    --model bedrock\/anthropic.claude-3-sonnet-20240229-v1:0 \n    --mean-input-tokens 1024 \n    --stddev-input-tokens 200 \n    --mean-output-tokens 1024 \n    --stddev-output-tokens 200 \n    --max-num-completed-requests 30 \n    --num-concurrent-requests 1 \n    --timeout 300 \n    --llm-api litellm \n    --results-dir bedrock-outputs<\/code><\/pre>\n<p class=\"wp-block-paragraph\">On this case we preserve the concurrency low, however be at liberty to toggle this quantity relying on what you\u2019re anticipating in manufacturing. Our check will run for 300 seconds and put up length you must see an output listing with two information representing statistics for every inference and likewise the imply metrics throughout all requests within the length of the check. <\/p>\n<p class=\"wp-block-paragraph\">We will make this look somewhat neater by parsing the abstract file with pandas:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import json\nfrom pathlib import Path\nimport pandas as pd\n\n# Load JSON information\nindividual_path = Path(\"bedrock-outputs\/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json\")\nsummary_path = Path(\"bedrock-outputs\/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json\")\n\nwith open(individual_path, \"r\") as f:\n    individual_data = json.load(f)\n\nwith open(summary_path, \"r\") as f:\n    summary_data = json.load(f)\n\n# Print abstract metrics\ndf = pd.DataFrame(individual_data)\nsummary_metrics = {\n    \"Mannequin\": summary_data.get(\"mannequin\"),\n    \"Imply Enter Tokens\": summary_data.get(\"mean_input_tokens\"),\n    \"Stddev Enter Tokens\": summary_data.get(\"stddev_input_tokens\"),\n    \"Imply Output Tokens\": summary_data.get(\"mean_output_tokens\"),\n    \"Stddev Output Tokens\": summary_data.get(\"stddev_output_tokens\"),\n    \"Imply TTFT (s)\": summary_data.get(\"results_ttft_s_mean\"),\n    \"Imply Inter-token Latency (s)\": summary_data.get(\"results_inter_token_latency_s_mean\"),\n    \"Imply Output Throughput (tokens\/s)\": summary_data.get(\"results_mean_output_throughput_token_per_s\"),\n    \"Accomplished Requests\": summary_data.get(\"results_num_completed_requests\"),\n    \"Error Price\": summary_data.get(\"results_error_rate\")\n}\nprint(\"Claude 3 Sonnet - Efficiency Abstract:n\")\nfor okay, v in summary_metrics.objects():\n    print(f\"{okay}: {v}\")<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The ultimate load check outcomes will look one thing like the next:<\/p>\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/image-57.png\" alt=\"\" class=\"wp-image-601770\"\/><figcaption class=\"wp-element-caption\">Screenshot by Writer<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">As we are able to see we see the enter parameters that we configured, after which the corresponding outcomes with time to first token(s) and throughput with regard to imply output tokens per second.<\/p>\n<p class=\"wp-block-paragraph\">In a real-world use case you would possibly use LLMPerf throughout many alternative mannequin suppliers and run exams throughout these platforms. With this device you should utilize it holistically to establish the appropriate mannequin and deployment stack on your use-case when used at scale.<\/p>\n<h2 class=\"wp-block-heading\">Extra Sources &amp; Conclusion<\/h2>\n<p class=\"wp-block-paragraph\">The whole code for the pattern could be discovered at this related <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/RamVegiraju\/load-testing-llms\/blob\/master\/bedrock-claude-benchmark.ipynb\">Github repository<\/a>. Should you additionally wish to work with SageMaker endpoints you will discover a Llama JumpStart deployment load testing pattern <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/github.com\/RamVegiraju\/load-testing-llms\/blob\/master\/sagemaker-llama-benchmark.ipynb\">right here<\/a>. <\/p>\n<p class=\"wp-block-paragraph\">All in all load testing and analysis are each essential to making sure that your LLM is performant in opposition to your anticipated site visitors earlier than pushing to manufacturing. In future articles we\u2019ll cowl not simply the analysis portion, however how we are able to create a holistic check with each parts.<\/p>\n<p class=\"wp-block-paragraph\">As all the time thanks for studying and be at liberty to go away any suggestions and join with me on <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/www.linkedin.com\/in\/ram-vegiraju-81272b162\/\">Linkedln<\/a> and <a rel=\"nofollow\" target=\"_blank\" href=\"https:\/\/x.com\/RamVegiraju\">X<\/a>.<\/p>\n<\/div>\n\n","protected":false},"excerpt":{"rendered":"<p>Language Mannequin (LLM) just isn&#8217;t essentially the ultimate step in productionizing your Generative AI software. An usually forgotten, but essential a part of the MLOPs lifecycle is correctly load testing your LLM and guaranteeing it is able to stand up to your anticipated manufacturing site visitors. Load testing at a excessive degree is the observe [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":1567,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55],"tags":[157,1482,1112,1481,1483],"class_list":["post-1565","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-data","tag-llmperf","tag-llms","tag-loadtesting","tag-science"],"_links":{"self":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/1565","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1565"}],"version-history":[{"count":1,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/1565\/revisions"}],"predecessor-version":[{"id":1566,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/posts\/1565\/revisions\/1566"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=\/wp\/v2\/media\/1567"}],"wp:attachment":[{"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1565"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1565"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/techtrendfeed.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1565"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69d9690a190636c2e0989534. Config Timestamp: 2026-04-10 21:18:02 UTC, Cached Timestamp: 2026-05-14 13:38:47 UTC -->