A Developer’s Information to Debugging JAX on Cloud TPUs: Important Instruments and Strategies

JAX on Cloud TPUs offers highly effective acceleration for machine studying workflows. When working in distributed cloud environments, you want specialised instruments to debug your workflows, together with accessing logs, {hardware} metrics, and extra. This weblog publish serves as a sensible information to varied debugging and profiling methods.

Selecting the best instrument: Core Parts and Dependencies

On the coronary heart of the system are two predominant parts that almost all debugging instruments depend upon:

libtpu (which comprises libtpu.so, the TPU Runtime): That is probably the most elementary piece of software program. It is a shared library on each Cloud TPU VM that comprises the XLA compiler, the TPU driver, and the logic for speaking with the {hardware}. Nearly each debugging instrument interacts with or is configured by way of libtpu.
JAX and jaxlib (The Framework): JAX is the Python library the place you write your mannequin code. jaxlib is its C++ backend, which acts because the bridge to libtpu.so

The connection between these parts and the debugging instruments is illustrated within the diagram beneath.

Here’s a breakdown of the precise instruments, their dependencies, and the way they relate to one another.

In abstract, libtpu is the central pillar that almost all debugging instruments depend on, both for configuration (logging, HLO dumps) or for querying real-time knowledge (monitoring, profiling). Different instruments, like XProf, additionally function on the Python stage to examine the state of your JAX program instantly. By understanding these relationships, you possibly can extra successfully select the best instrument for the precise challenge you might be going through.

Important Logging and Diagnostic Flags for Each Workload

Verbose Logging

Probably the most important step for debugging is to allow verbose logging. With out it, you might be flying blind. These flags must be thought-about on each employee of your TPU slice, to log all the things from TPU runtime setup to program execution steps with timestamps

If you wish to allow the above default flags on each TPU employee nodes, run the next command:

gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID} 
  --zone ${ZONE} --worker=all --node=all 
  --command='TPU_VMODULE=slice_configuration=1,real_program_continuator=1 TPU_MIN_LOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=0 TPU_STDERR_LOG_LEVEL=0 python3 -c "import jax; print(f"Host {jax.process_index()}: International units: {jax.device_count()}, Native units: {jax.local_device_count()}")"'

Plain textual content

Libtpu logs are robotically generated in /tmp/tpu_logs/tpu_driver.INFO on every TPU VM. This file is your floor reality for what the TPU runtime is doing. To get logs from all TPU VMs, you possibly can run the next bash script:

#!/bin/bash

TPU_NAME="your TPU TPU_NAME"
PROJECT="challenge on your TPU"
ZONE="zone on your TPU"
BASE_LOG_DIR="path to the place you need the logs to be downloaded to"

NUM_WORKERS=$(gcloud  compute tpus tpu-vm describe $TPU_NAME --zone=$ZONE --project=$PROJECT | grep tpuVmSelflink | awk -F'[:/]' '{print $13}' | uniq | wc -l)

echo "Variety of staff = $NUM_WORKERS"

for ((i=0; i<$NUM_WORKERS; i++))
do
  mkdir -p ${BASE_LOG_DIR}/$i
  echo "gcloud compute tpus tpu-vm scp  ${TPU_NAME}:/tmp/tpu_logs/*  ${BASE_LOG_DIR}/$i/  --zone=${ZONE} --project=${PROJECT} --worker=$i"
  echo "Obtain logs from employee=$i"
  gcloud compute tpus tpu-vm scp  ${TPU_NAME}:/tmp/tpu_logs/*  ${BASE_LOG_DIR}/$i/  --zone=${ZONE} --project=${PROJECT} --worker=$i
executed

Plain textual content

On Google Colab, you possibly can set the above atmosphere variables utilizing os.environ, and entry the logs within the “Information” part within the left sidebar.

Listed here are some instance snippets from a log file:

...
I1031 19:02:51.863599     669 b295d63588a.cc:843] Course of id 669
I1031 19:02:51.863609     669 b295d63588a.cc:848] Present working listing /content material
...
I1031 19:02:51.863621     669 b295d63588a.cc:866] Construct instrument: Bazel, launch r4rca-2025.05.26-2 (mainline @763214608)
I1031 19:02:51.863621     669 b295d63588a.cc:867] Construct goal: 
I1031 19:02:51.863624     669 b295d63588a.cc:874] Command line arguments:
I1031 19:02:51.863624     669 b295d63588a.cc:876] argv[0]: './tpu_driver'
...
 19:02:51.863784     669 init.cc:78] Distant crash gathering hook put in.
I1031 19:02:51.863807     669 tpu_runtime_type_flags.cc:79] --tpu_use_tfrt not specified. Utilizing default worth: true
I1031 19:02:51.873759     669 tpu_hal.cc:448] Registered plugin from module: breakpoint_debugger_server
...
I1031 19:02:51.879890     669 pending_event_logger.cc:896] Enabling PjRt/TPU occasion dependency logging
I1031 19:02:51.880524     843 device_util.cc:124] Discovered 1 TPU v5 lite chips.
...
I1031 19:02:53.471830     851 2a886c8_compiler_base.cc:3677] CODE_GENERATION stage length: 3.610218ms
I1031 19:02:53.471885     851 isa_program_util_common.cc:486] (HLO module jit_add): Executable fingerprint:0cae8d08bd660ddbee7ef03654ae249ae4122b40da162a3b0ca2cd4bb4b3a19c

Plain textual content

TPU Monitoring Library

The TPU Monitoring Library is a strategy to programmatically acquire insights about workflow efficiency on TPU {hardware} (utilization, capability, latency, and extra). It is part of the libtpu bundle, which is robotically put in (as a dependency) with jax[tpu], so you can begin utilizing the monitoring API instantly.

# Specific set up
pip istall "jax[tpu]" libtpu

Shell

You may view all supported metrics with tpumonitoring.list_supported_metrics() and get particular metrics with tpumonitoring.get_metric. For instance, the next snippet prints the duty_cycle knowledge and outline:

from libtpu.sdk import tpumonitoring

duty_cycle_metric = tpumonitoring.get_metric("duty_cycle_pct")
duty_cycle_data = duty_cycle_metric.knowledge
print("TPU Obligation Cycle Information:")
print(f"  Description: {duty_cycle_metric.description}")
print(f"  Information: {duty_cycle_data}")

Python

You’d usually combine tpumonitoring instantly in your JAX packages, throughout mannequin coaching, earlier than inference, and so on. Study extra concerning the Monitoring Library within the Cloud TPU documentation.

tpu-info

The tpu-info command-line instrument is a straightforward strategy to get a real-time view of TPU reminiscence and different utilization metrics, much like nvidia-smi for GPUs.

Set up on all staff and nodes

gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID} 
  --zone ${ZONE} --worker=all --node=all 
  --command='pip set up tpu-info'

Plain textual content

SSH into one employee and node to test chip utilization metrics

gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID} 
  --zone ${ZONE} --worker=0 --node=0

tpu-info

Plain textual content

When chips are in use, course of IDs, reminiscence utilization, and responsibility cycle% will likely be displayed

When no chips are in use, the TPU VM will present no exercise

Study extra about different metrics and streaming mode in the documentation.

On this publish, we mentioned some TPU logging and monitoring choices. Subsequent on this sequence, we’ll discover find out how to debug your JAX packages beginning with producing HLO dumps, and profiling your code with the XProf.