Zheng Xiaolong, chief researcher of Data Center Network (DCN), Huawei Canada Research Center, delivered a keynote speech titled "Zero-packet-loss ethernet helps release 100% computing power" at the MPLS, SD & AI Net World Congress. In the keynote, Zheng delved into how Huawei's CloudFabric 3.0 Hyper-Converged DCN solution offers an innovative solution to the packet loss problem on DCNs and builds ethernets with low latency, high throughput, and large scale to unleash 100% of computing power.
Efficient improvement of computing power Is crucial in the data-centric computing power era
"Insufficient computing power is the biggest challenge in the data-centric computing power era," said Zheng Xiaolong. "To implement real-time data processing and value monetization, robust computing power is required.
Today, big data has uses everywhere, spanning everything from the metaverse and AI-powered drug research, to user habit-based intelligent advertisement recommendation. Key to such big data applications is robust computing power, yet the scale of AI computing models is growing exponentially. For example, Megatron-Turing NLG — the industry's latest language model — now supports 53 billion parameters. In comparison, even the most complex model in 2017 supported a mere 61 million parameters. In other words, the computing pressure increased by almost 10,000 times in the past five years. Evidently, finding a way to efficiently improve computing power and unleash 100% of the computing power has become the top priority in the computing power era.
DCNs become the core bottleneck for improving cluster computing power
Completing E-level floating-point computing operations required to train an AI model, such as the GPT3 language model, requiring a large number of computing servers to form a cluster. However, all AI training clusters have their performance threshold. Once the threshold is reached, even if more server nodes are added, performance cannot improve and may even deteriorate. This is because computing nodes collaborate with each other in the cluster and, if packet loss occurs on the network, the overhead will increase due to the prolonged waiting time for collaboration. Even with a 0.1% packet loss, the computing power will be slashed in half, making a lossless DCN vital to improving computing power.
Lossless ethernet built on Huawei's CloudFabric 3.0 Hyper-Converged DCN solution, unleashing 100% of computing power
Huawei's CloudFabric 3.0 Hyper-Converged DCN solution leverages iLossless — a Huawei-unique intelligent and lossless algorithm — to eliminate packet loss that has hampered ethernets for more than four decades. This solution features high throughput, low latency, and zero packet loss, unleashing 100% of computing power in all scenarios.
- High throughput: Traditional traffic scheduling is manually configured, and as such cannot adapt to dynamic network changes. Huawei's Automatic ECN (ACC) is an intelligent and lossless technology that accurately predicts network congestion status and achieves nearly 100% throughput while eliminating packet loss on any congested link. As verified by Tolly Group, a global provider of testing and third-party validation and certification services, Huawei's CloudFabric 3.0 Hyper-Converged DCN solution can drive up the all-flash IOPS performance by 93%. In August 2021, the paper “ACC: Automatic ECN tuning for high-speed datacenter networks” explored Huawei's intelligent and lossless hyper-converged DCN innovations, and was accepted by the association for computing machinery (ACM)'s flagship annual event: the special interest group on data communication (SIGCOMM) 2021. This demonstrates industry experts' high regard for Huawei's innovations, and that these innovations have a far-reaching impact felt around the world.
- Low latency: In high-performance computing (HPC) scenarios, application latency is the product of the number of calculation steps and the latency of each step. For latency-sensitive applications, reducing the number of steps can effectively reduce the overall application latency. Powered by in-network computing and topology-aware computing, Huawei's integrated network and computing (INC) technology implements network and computing collaboration. With these technologies, the network participates in aggregation and synchronization of computing information, reducing the number of times computing information is synchronized. Meanwhile, computing tasks are assigned to the same TOR switch, reducing the number of communication hops, which in turn reduces the application delay. Take MPI_allreduce as an example. Compared with traditional networks that only forward data without participating in computing, the CloudFabric 3.0 Hyper-Converged DCN solution can drastically reduce the latency and improve computing efficiency by 27%.
- Large scale: The traditional three-layer Clos network architecture of a data center supports a maximum of 65,000 nodes, far short of that required by large-scale data centers. Huawei's CloudFabric 3.0 Hyper-Converged DCN solution adopts the next-generation direct connection topology architecture and innovative distributed adaptive routing protocols. It not only builds a lossless computing network, but also supports large-scale networking of up to 270,000 nodes, four times that of the industry. This makes it ideal for E-level and 10E-level large and ultra-large computing hubs.
Zero packet loss and continuous performance evolution are of great significance to the data-centric computing power era. Huawei has carried out full-scale joint tests with customers throughout the finance, manufacturing, and HPC sectors. The test results prove that Huawei's CloudFabric 3.0 Hyper-Converged DCN solution has significant performance advantages in scenarios such as all-flash, distributed storage, HPC, and AI computing. In the future, Huawei will continue to invest in intelligent and lossless technology research to further improve lossless network capabilities, fully unleash computing power, and enable intelligent upgrade of enterprises.