Google launches TPU monitoring library to boost AI infrastructure efficiency

Monday July 21, 2025. 01:05 PM , from InfoWorld

Google has introduced a new monitoring library to enhance Tensor Processing Unit (TPU) resource efficiency as enterprises scale AI workloads to meet growing internal and customer demand while managing costs.The TPU Monitoring Library is integrated within the LibTPU library — the foundational library that powers machine learning frameworks such as JAX, PyTorch, and TensorFlow to run models on Google Cloud TPUs.

“The TPU Monitoring Library gives you (enterprise users) detailed information on how machine learning workloads are performing on TPU hardware. It’s designed to help you understand your TPU utilization, identify bottlenecks, and debug performance issues,” Google explained in its documentation.

While the library uses telemetry API and a metrics suite to deliver detailed insights into operational performance and TPUs’ behavior, it provides a software development kit (SDK) and a command-line interface (CLI) as a diagnostic toolkit to allow enterprises to perform in-depth performance analysis of TPU resources and carry out debugging.

Observability and insights into AI infrastructure performance are critical areas for enterprises when it comes to scaling their AI workloads, Charlie Dai, vice president and principal analyst at Forrester, said.

“According to Forrester’s Tech Pulse Survey in Q4 2024, 85% of IT decision-makers are focusing on observability and AIOps in general,” Dai added.

Google’s new TPU Monitoring Library offers at least seven indicators that enterprises can use to determine TPU utilization and efficiency.

These indicators include Tensor Core Utilization, which measures how effectively the TPU’s specialized cores are being used during operations, and the Duty Cycle Percentage indicator that reveals how busy each TPU chip is over time.

It also offers the HBM Capacity Total and HBM Capacity Usage indicators that track the total and active use of high-bandwidth memory, respectively.

For network performance, the Buffer Transfer Latency metric is used to capture latency distributions for large-scale data transfers, helping identify communication bottlenecks, Google said in its documentation.

Additionally, the library comes with High Level Operation (HLO) Execution Time Distribution Metrics, offering detailed timing breakdowns of compiled operations, and HLO Queue Size, which monitors execution pipeline congestion.

AWS and Microsoft, too, have similar tools

However, Google isn’t the only AI infrastructure provider that is releasing tools to optimize resources (CPU accelerators, GPUs) performance and usage.Rival hyperscaler AWS has a host of ways using which enterprises can optimize their cost of running AI workloads while ensuring maximum usage of their resources.

To begin with, it provides Amazon CloudWatch — a service that is capable of providing end-to-end observability on training workloads running on Trainium and Inferentia, including metrics like GPU/accelerator utilization, latency, throughput, and resource availability.

AWS services such as SageMaker, via offerings like SageMaker HyperPod, also make way for more efficient usage of resources while reducing training time.

In contrast to the manual model training process — which is prone to delays, unnecessary expenditure, and other complications — HyperPod removes the heavy lifting involved in building and optimizing machine learning infrastructure for training models, reducing training time by up to 40%, according to AWS.

Similar to the TPU Monitoring Library, Microsoft offers Maia SDK as the core toolkit to optimize model execution for its Azure Maia chipsets, together with developer tools like Maia Debugger and Profiler for debugging and tracking, Dai said.

Although rivals are offering similar tools, Dai pointed out that the new monitoring library is expected to effectively help Google Cloud further expand its footprint in the AI-native infrastructure cloud market.