JAX on Cloud TPUs offers highly effective acceleration for machine studying workflows. When working in distributed cloud environments, you want specialised instruments to debug your workflows, together with accessing logs, {hardware} metrics, and extra. This weblog publish serves as a sensible information to varied debugging and profiling methods.
Selecting the best instrument: Core Parts and Dependencies
On the coronary heart of the system are two predominant parts that almost all debugging instruments depend upon:
- libtpu (which comprises libtpu.so, the TPU Runtime): That is probably the most elementary piece of software program. It is a shared library on each Cloud TPU VM that comprises the XLA compiler, the TPU driver, and the logic for speaking with the {hardware}. Nearly each debugging instrument interacts with or is configured by way of libtpu.
- JAX and jaxlib (The Framework): JAX is the Python library the place you write your mannequin code. jaxlib is its C++ backend, which acts because the bridge to libtpu.so
The connection between these parts and the debugging instruments is illustrated within the diagram beneath.
Here’s a breakdown of the precise instruments, their dependencies, and the way they relate to one another.
In abstract, libtpu is the central pillar that almost all debugging instruments depend on, both for configuration (logging, HLO dumps) or for querying real-time knowledge (monitoring, profiling). Different instruments, like XProf, additionally function on the Python stage to examine the state of your JAX program instantly. By understanding these relationships, you possibly can extra successfully select the best instrument for the precise challenge you might be going through.
Important Logging and Diagnostic Flags for Each Workload
Verbose Logging
Probably the most important step for debugging is to allow verbose logging. With out it, you might be flying blind. These flags must be thought-about on each employee of your TPU slice, to log all the things from TPU runtime setup to program execution steps with timestamps
If you wish to allow the above default flags on each TPU employee nodes, run the next command:
gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID}
--zone ${ZONE} --worker=all --node=all
--command='TPU_VMODULE=slice_configuration=1,real_program_continuator=1 TPU_MIN_LOG_LEVEL=0 TF_CPP_MIN_LOG_LEVEL=0 TPU_STDERR_LOG_LEVEL=0 python3 -c "import jax; print(f"Host {jax.process_index()}: International units: {jax.device_count()}, Native units: {jax.local_device_count()}")"'
Plain textual content
Libtpu logs are robotically generated in /tmp/tpu_logs/tpu_driver.INFO on every TPU VM. This file is your floor reality for what the TPU runtime is doing. To get logs from all TPU VMs, you possibly can run the next bash script:
#!/bin/bash
TPU_NAME="your TPU TPU_NAME"
PROJECT="challenge on your TPU"
ZONE="zone on your TPU"
BASE_LOG_DIR="path to the place you need the logs to be downloaded to"
NUM_WORKERS=$(gcloud compute tpus tpu-vm describe $TPU_NAME --zone=$ZONE --project=$PROJECT | grep tpuVmSelflink | awk -F'[:/]' '{print $13}' | uniq | wc -l)
echo "Variety of staff = $NUM_WORKERS"
for ((i=0; i<$NUM_WORKERS; i++))
do
mkdir -p ${BASE_LOG_DIR}/$i
echo "gcloud compute tpus tpu-vm scp ${TPU_NAME}:/tmp/tpu_logs/* ${BASE_LOG_DIR}/$i/ --zone=${ZONE} --project=${PROJECT} --worker=$i"
echo "Obtain logs from employee=$i"
gcloud compute tpus tpu-vm scp ${TPU_NAME}:/tmp/tpu_logs/* ${BASE_LOG_DIR}/$i/ --zone=${ZONE} --project=${PROJECT} --worker=$i
executed
Plain textual content
On Google Colab, you possibly can set the above atmosphere variables utilizing os.environ, and entry the logs within the “Information” part within the left sidebar.
Listed here are some instance snippets from a log file:
...
I1031 19:02:51.863599 669 b295d63588a.cc:843] Course of id 669
I1031 19:02:51.863609 669 b295d63588a.cc:848] Present working listing /content material
...
I1031 19:02:51.863621 669 b295d63588a.cc:866] Construct instrument: Bazel, launch r4rca-2025.05.26-2 (mainline @763214608)
I1031 19:02:51.863621 669 b295d63588a.cc:867] Construct goal:
I1031 19:02:51.863624 669 b295d63588a.cc:874] Command line arguments:
I1031 19:02:51.863624 669 b295d63588a.cc:876] argv[0]: './tpu_driver'
...
19:02:51.863784 669 init.cc:78] Distant crash gathering hook put in.
I1031 19:02:51.863807 669 tpu_runtime_type_flags.cc:79] --tpu_use_tfrt not specified. Utilizing default worth: true
I1031 19:02:51.873759 669 tpu_hal.cc:448] Registered plugin from module: breakpoint_debugger_server
...
I1031 19:02:51.879890 669 pending_event_logger.cc:896] Enabling PjRt/TPU occasion dependency logging
I1031 19:02:51.880524 843 device_util.cc:124] Discovered 1 TPU v5 lite chips.
...
I1031 19:02:53.471830 851 2a886c8_compiler_base.cc:3677] CODE_GENERATION stage length: 3.610218ms
I1031 19:02:53.471885 851 isa_program_util_common.cc:486] (HLO module jit_add): Executable fingerprint:0cae8d08bd660ddbee7ef03654ae249ae4122b40da162a3b0ca2cd4bb4b3a19c
Plain textual content
TPU Monitoring Library
The TPU Monitoring Library is a strategy to programmatically acquire insights about workflow efficiency on TPU {hardware} (utilization, capability, latency, and extra). It is part of the libtpu bundle, which is robotically put in (as a dependency) with jax[tpu], so you can begin utilizing the monitoring API instantly.
# Specific set up
pip istall "jax[tpu]" libtpu
Shell
You may view all supported metrics with tpumonitoring.list_supported_metrics() and get particular metrics with tpumonitoring.get_metric. For instance, the next snippet prints the duty_cycle knowledge and outline:
from libtpu.sdk import tpumonitoring
duty_cycle_metric = tpumonitoring.get_metric("duty_cycle_pct")
duty_cycle_data = duty_cycle_metric.knowledge
print("TPU Obligation Cycle Information:")
print(f" Description: {duty_cycle_metric.description}")
print(f" Information: {duty_cycle_data}")
Python
You’d usually combine tpumonitoring instantly in your JAX packages, throughout mannequin coaching, earlier than inference, and so on. Study extra concerning the Monitoring Library within the Cloud TPU documentation.
tpu-info
The tpu-info command-line instrument is a straightforward strategy to get a real-time view of TPU reminiscence and different utilization metrics, much like nvidia-smi for GPUs.
Set up on all staff and nodes
gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID}
--zone ${ZONE} --worker=all --node=all
--command='pip set up tpu-info'
Plain textual content
SSH into one employee and node to test chip utilization metrics
gcloud alpha compute tpus queued-resources ssh ${QUEUED_RESOURCE_ID} --project ${PROJECT_ID}
--zone ${ZONE} --worker=0 --node=0
tpu-info
Plain textual content
When chips are in use, course of IDs, reminiscence utilization, and responsibility cycle% will likely be displayed
When no chips are in use, the TPU VM will present no exercise
Study extra about different metrics and streaming mode in the documentation.
On this publish, we mentioned some TPU logging and monitoring choices. Subsequent on this sequence, we’ll discover find out how to debug your JAX packages beginning with producing HLO dumps, and profiling your code with the XProf.







