Cease hand-tuning kernels: How Neuron Agentic Growth accelerates AWS Trainium optimizations

As frontier AI fashions develop in scale and complexity, builders face a typical problem throughout each {hardware} platform: how do you extract the utmost efficiency and effectivity from the silicon their fashions run on. Whether or not delivering real-time experiences for world fashions, supporting deeper reasoning in agentic workflows, or decreasing inference prices at scale, the hole between what {hardware} can theoretically ship and what most groups obtain stays important. Customized kernel growth has traditionally been the trail to closing that hole, nevertheless it calls for deep architectural experience, guide profiling workflows, and iterative optimization cycles that few groups can afford.

This doesn’t have to be the case. What if each machine studying (ML) engineer might function as a efficiency engineer, writing hardware-aware kernels, diagnosing bottlenecks, and delivery optimized fashions, with out years of chip-level expertise? What if builders already proficient on one structure might ramp up on one other in days as a substitute of months?

At the moment, we’re saying the Neuron Agentic Growth capabilities: a group of AI brokers and abilities that make this attainable for builders constructing on AWS Trainium and AWS Inferentia. The primary capabilities equip coding brokers in Kiro and Claude to creator, debug, and profile Neuron Kernel Interface (NKI) kernels, extending ML efficiency engineering to each developer on the crew. Kernel builders coming from different architectures can scale shortly to Trainium, groups can shorten the time from concept to hardware-optimized implementation, and the deep architectural information that after gatekept kernel growth is now accessible by way of agentic tooling that guides builders at every step.

On this publish, we clarify how the Neuron Agentic Growth capabilities speed up the kernel growth workflow.

The Neuron Agentic Growth abilities

The Neuron Agentic Growth bundle supplies 5 specialised abilities that observe the pure kernel growth pipeline: write → debug → profile → analyze. You may invoke abilities individually for focused duties, or chain them along with the neuron-nki-agent, which auto-selects the correct workflow primarily based in your request. To make use of them, add the abilities to your agentic IDE’s abilities listing. For instance, in any IDE like VS Code, Cursor, or Kiro, add the abilities within the .kiro/abilities or .claude/abilities listing and make them out there to your brokers. Expertise should run on a Trainium-based Amazon Elastic Compute Cloud (Amazon EC2) occasion.

Kernel authoring

The neuron-nki-writing talent is your start line for creating NKI kernels. It interprets PyTorch, NumPy, or pure language descriptions into appropriate NKI code. For instance, it covers tiling methods that respect {hardware} constraints (akin to 128 partition dimension and 512/4096 PSUM free dimension), reminiscence entry patterns, compute operations with specific dst parameters, and effectivity tips for DMA sizing and SBUF reuse. The talent classifies your job by complexity and masses solely the references wanted.

Debugging

The neuron-nki-debugging talent supplies a scientific workflow for resolving NKI compilation and execution errors on Trainium and Inferentia {hardware}. For instance, it covers atmosphere setup with the proper --target flags, compiler error decision with a categorized index of all 28 NCC error codes, and numerical validation towards CPU-computed references.

Profiling and evaluation

The neuron-nki-profiling talent captures execution profiles on {hardware}. It configures runtime inspection atmosphere variables, runs the kernel, identifies the proper Neuron Execution File Format (NEFF), and captures the hint with neuron-explorer together with DGE (DMA Graph Engine) notifications for DMA-level element. It extracts JSON metrics and produces the NEFF recordsdata that neuron-nki-profile-querying consumes.

The neuron-nki-profile-querying talent ingests NEFF and NTFF recordsdata and runs SQL queries to compute efficiency bounds, establish bottleneck engines, and localize inefficiencies to particular NKI supply traces. It helps three evaluation approaches: the neuron-explorer API server, DuckDB instantly on parquet, or pandas for customized computation.

Documentation

The neuron-nki-docs talent is used all through growth. Throughout authoring, it supplies API signatures and tutorials. Throughout debugging, it explains error codes. Throughout profiling, it clarifies {hardware} structure particulars. Ask a couple of particular nisa.* or nl.* API, lookup error codes, discover tutorials, or browse structure guides for Trainium 1, 2, and three.

The brokers

Whereas abilities present constructing blocks for particular person duties, brokers mix a number of abilities into autonomous workflows. Every agent is a specialised persona that handles multi-step growth situations end-to-end.

The neuron-nki-agent is the unified entry level for NKI growth. It robotically selects the correct workflow primarily based in your request (writing, debugging, profiling, or documentation lookup) and orchestrates the suitable abilities. That is the default start line.
The neuron-nki-writing-agent focuses completely on kernel authoring. It interprets PyTorch, NumPy, or pure language descriptions into NKI code and handles modifications to current kernels.
The neuron-nki-debugging-agent autonomously resolves compiler errors by analyzing the error, looking documentation for fixes, and making use of corrections. It tracks iterations (as much as 10) and progressively simplifies when caught.
The neuron-nki-docs-agent is a light-weight documentation navigator for API signatures, error code explanations, tutorials, and structure particulars.
The neuron-nki-profile-analysis-agent runs two separate abilities to establish efficiency bottlenecks. It makes use of the neuron-nki-profile talent to seize execution profiles on {hardware}: it units atmosphere variables, runs the kernel, identifies NEFFs, and runs neuron-explorer seize to provide profile parquet recordsdata. It then makes use of the neuron-nki-profile-querying talent to run SQL queries towards these parquet recordsdata to compute efficiency bounds, establish bottleneck engines, and localize inefficiencies to particular NKI supply traces.

Placing it into observe: Optimizing a customized softmax kernel

The next walkthrough exhibits how these agentic capabilities work collectively in observe. You discover two kernels: a softmax kernel (Steps 1 and a couple of) and a SwiGLU kernel (Steps 3 and 4), which demonstrates profiling on a real-world workload.

Suppose you might have a PyTorch softmax operation that’s a bottleneck in your inference pipeline, and also you wish to write a customized NKI kernel to fuse it with a previous scale operation.

Step 0: Arrange your occasion and atmosphere

To rise up and working:

Launch a trn2.3xlarge occasion by way of AWS MLCBs utilizing the AWS Neuron Deep Studying AMI (DLAMI). São Paulo (sa-east-1) and Melbourne (ap-southeast-4) are used as instance AWS Areas right here. See the complete Trainium availability record for different supported Areas.
Join by utilizing SSH into the occasion.

Set up Kiro:

curl -fsSL https://cli.kiro.dev/set up | bash

Set up Neuron Agentic Growth abilities following the directions at the neuron-agentic-development repository.

Be aware: trn2.3xlarge situations incur hourly fees whereas working. Keep in mind to terminate the occasion whenever you end this walkthrough to keep away from ongoing prices.

For extra detailed occasion setup and configuration directions, see the Neuron DLAMI Setup Information.

From the distant terminal, confirm the neuron units are seen:

# Affirm Neuron units are seen
neuron-ls

# Affirm neuron-explorer is on the market
which neuron-explorer && neuron-explorer --version

The DLAMI comes with a pre-installed digital atmosphere at:

~decide/aws_neuronx_venv_pytorch_2_9

Activate it with:

supply ~decide/aws_neuronx_venv_pytorch_2_9/bin/activate

With the atmosphere setup, you may get began creating kernels by working:

kiro-cli --agent neuron-nki-agent

Step 1: Write the kernel

Within the interactive Kiro CLI session, enter the next immediate: “Write an NKI kernel that computes scaled softmax: softmax(x * scale) alongside the final dimension, for enter form [batch, seq_len, hidden_dim] in bfloat16.”

The agent produces a whole three-pass kernel (row max, sum-of-exp, normalize) utilizing nisa.activation(np.exp, ...) for hardware-accelerated exp, float32 accumulation for numerical stability, and correct tiling throughout the free dimension. It explains its design choices: one program occasion per row, P_MAX=128 (matching the 128-partition {hardware} restrict), F_MAX=2048 (matching the 2048-element free dimension restrict on Trainium), and bfloat16 output forged.

Determine 1: NKI agent authoring a kernel.

Step 2: Debug on {hardware}

Ask the agent to run the kernel and confirm numerical parity towards a PyTorch reference.

The agent hits a right away snag: nisa.tensor_tensor doesn’t auto-broadcast discount outcomes, so the per-row max and sum values can’t be instantly utilized throughout the complete hidden dimension. The agent consults the NKI reference patterns, identifies the proper broadcast mechanism (stride-0 entry views by way of .ap()), and rewrites the kernel accordingly.

After syncing the corrected kernel to the occasion and working on-device:

PASS: form=(2, 128, 512), max_diff=0.000008
PASS: form=(4, 256, 1024), max_diff=0.000004
PASS: form=(1, 1, 64), max_diff=0.000061
PASS: form=(2, 300, 768), max_diff=0.000007

All exams handed.

All 4 instances move with max error effectively inside bfloat16 tolerance, confirming the kernel is numerically appropriate on actual Trainium {hardware}.

Determine 2: NKI agent debugging its errors.

Step 3: Profile and analyze kernel execution

After the kernel compiles and produces numerically appropriate outcomes, the following step is to profile execution on {hardware} to establish efficiency bottlenecks and information optimizations.

To display profiling and evaluation on a real-world workload, this step makes use of a SwiGLU MLP kernel, a typical module in giant language fashions (LLMs).

Level the agent on the SwiGLU kernel and ask it to research the profile. The agent first compiles the kernel to a NEFF and captures an NTFF hint by way of neuron-explorer. Then it runs a two-part investigation into the kernel, wanting first at kernel-level statistics and efficiency bounds, after which deep into particular inefficiencies by querying the profile on the instruction execution degree.

First the agent runs a full bounds evaluation on the captured profile and finds a number of gaps price investigating:

Determine 3: NKI agent extracts abstract statistics and calculates efficiency bounds on the kernel.

It finds a number of gaps price investigating additional. The TE engine dominates execution and is inefficient. It additionally has giant idle gaps, which suggests it is likely to be price investigating its more than likely dependency (DMA engine), the place we will see work that’s each redundant and inefficient.

Determine 4: NKI agent investigates inefficiencies within the profile and supplies an evaluation.

The investigations assist us audit the gaps and prioritize actionable optimization instructions. Whereas the bottleneck engine’s (Tensor Engine) inefficiency would have been the highest goal for optimization, the agent finds that the NKI matmul directions are already performing close to their peak effectivity. In distinction, we discover that DMA directions are effectively beneath their goal measurement (inefficient) and that we’re additionally reloading all inputs eight instances (redundant). We even discover the three actual traces of NKI code answerable for the suboptimal transfers. Addressing these traces would possibly in flip scale back the TE’s idle hole and enhance kernel latency.

Issues to know

Maintain the next concerns in thoughts when working with Neuron Agentic Growth abilities and brokers.

Profiling and debugging abilities require execution on precise Trainium or Inferentia-based situations.
The writing and docs abilities work wherever.
All abilities goal the present NKI Beta 3 API. Expertise assist Trainium1 (gen2), Trainium2 (gen3), and Trainium 3 (gen4) with applicable --target flags.
The talents and brokers are designed to work collectively. The highest-level agent robotically invokes profiling and debugging abilities as wanted.

Cleanup

To keep away from ongoing fees, terminate the trn2.3xlarge occasion you created in Step 0. You are able to do this by way of the AWS Administration Console (EC2 > Cases, choose the occasion, and select Occasion state > Terminate), or run:

aws ec2 terminate-instances --instance-ids

Affirm that the occasion state exhibits “terminated” earlier than closing the console.

What’s subsequent

The kernel authoring and profiling abilities decrease the barrier to writing high-performance kernels on Trainium, however they’re solely the primary a part of a broader imaginative and prescient.

At the moment, builders use profiling insights to information their subsequent spherical of kernel edits. This iterative cycle (profile, diagnose, refactor, re-profile) is the place probably the most time is spent. We wish to make this loop totally agentic. For instance, brokers that autonomously iterate on a kernel till it meets its efficiency goal, with out requiring the developer to interpret every profiling outcome and hand-craft the following repair.

We additionally hear from efficiency builders that customized kernels are just one half of a bigger problem. Builders need their fashions to run on Trainium with out having to fret about porting mannequin code and syntax, resolving operator gaps, making use of model-level optimizations, and validating correctness at scale. We wish to deliver the identical agentic method to this broader downside.

In abstract, our imaginative and prescient is to assist the following wave of improvements for frontier fashions utilizing Trainium and the Neuron SDK, and to make use of the suite of Neuron Agentic Growth capabilities to realize main cost-performance to be used instances starting from experimentation with new mannequin architectures to working manufacturing fashions at scale.

We’ll share extra as these capabilities mature. To get began with what’s out there right this moment, go to the Neuron Agentic Growth GitHub repository.

Come construct with us

The Neuron Agentic Growth capabilities can be found right this moment. Get began now: clone the neuron-agentic-development repository and write your first NKI kernel in minutes.

Right here’s the best way to dive in:

Begin with the neuron-nki-agent. It selects the correct workflow primarily based in your request, supplying you with the complete autonomous expertise end-to-end.
Run the talent examples. Invoke particular person abilities instantly (for instance, /neuron-nki-writing) for focused duties, or chain /neuron-nki-profiling and /neuron-nki-profile-querying as soon as your kernel is producing appropriate outcomes.
Open a GitHub concern in case you run into an issue or have an concept. We’re actively creating alongside the group and can get again to you.
Contribute again. Submit PRs, share kernels you’ve constructed, and assist us make these instruments higher for everybody.

We’re constructing these capabilities within the open as a result of the most effective developer instruments are formed by the builders who use them. Come construct with us.