
NVIDIA Unveils NCCL 2.27: Enhancing AI Training and Inference Efficiency
July 15, 2025 – NVIDIA has announced the release of its latest update to its Collective Communications Library (NCCL) – version 2.27. This groundbreaking innovation is specifically designed to boost the efficiency of AI workloads by significantly improving GPU communication.
The NCCL 2.27 upgrade aims to cater to the rising demands of both training and inference tasks, ensuring fast and reliable operations at scale, according to NVIDIA’s official blog . Key enhancements include reducing latency and increasing bandwidth efficiency across GPUs.
One of the key performance enhancements is the implementation of low-latency kernels with symmetric memory, which optimizes collective operations by utilizing buffers with identical virtual addresses. This innovation results in a substantial reduction in latency, reaching up to 7.6x for small message sizes, making it ideal for real-time inference pipelines.
Another notable feature is the inclusion of Direct NIC support, which facilitates full network bandwidth utilization for GPU scale-out communications. This development is particularly beneficial for high-throughput inference and training workloads, ensuring networking efficiency without saturating CPU-GPU bandwidth.
In addition to these enhancements, NCCL 2.27 introduces support for SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) for NVLink and InfiniBand fabrics. This protocol offloads compute-intensive tasks, enhancing large-scale training by decreasing the computational demand on GPUs and improving scalability and performance, particularly for large language model (LLM) training.
To address the challenges of large-scale distributed training, NCCL 2.27 incorporates the Communicator Shrink function. This allows for dynamic exclusion of failed or unnecessary GPUs, ensuring uninterrupted training processes. It supports both default and error modes for planned reconfigurations and unexpected device failures, respectively.
Furthermore, the update brings new features for developers, including symmetric memory APIs and enhanced profiling tools. These enhancements provide developers with more precise instrumentation for diagnosing communication performance and optimizing AI workloads.
For more information on NCCL 2.27 and its innovative capabilities, interested parties can visit the NVIDIA/nccl GitHub repository .
Source: Blockchain.News