3 min read

AI Needs a New Network

Marc Austin : Feb 22, 2024 3:25:27 PM

Events Blog

Most investors associate NVIDIA with GPUs and recognize GPUs as the key ingredient for AI cloud infrastructure. But that’s only part of the AI infrastructure picture. Investors need to also understand the role of networking for AI. Smart investors should pay attention to some of the finer details from yesterday’s NVIDIA earnings call.

Networking is a third of the cost of an AI data center

NVIDIA delivered another great quarter reporting $18.4 billion in data center revenue up 27% from the prior quarter and 265% y/y. They also reported $13 billion in networking ARR up from $10 billion reported last quarter and 3x y/y. Networking ARR is now 27% of total data center revenue!

Why does AI need a new network?

Jensen stated several times that AI needs a new network. We have heard the same thing from countless customers building AI cloud infrastructure. New AI cloud service providers, enterprise AI clouds and sovereign AI clouds all need new networks that maximize utilization of expensive GPU resources. This comes down to 3 performance requirements, as well as a number of cloud user experience with several more front-end requirements.

AI needs a new network. Hedgehog is the AI Network.

1. High effective bandwidth

AI needs high effective bandwidth. If you deploy a 400G or 800G network, you expect to get 400G or 800G bandwidth 95% of the time. AI training and fine tuning workloads create congestion, congestion slows down effective bandwidth, and slow networks lead to longer training and fine-tuning duration. Time is money, especially with expensive GPU time.

2. Zero packet loss

AI needs 100% of packets to reach their intended destination. Traditional TCP/IP networks signal congestion with packet loss. Packet loss causes AI workloads to pause or fail. Restarting a workload at the last checkpoint is of course expensive. Time is money, especially with expensive GPU time.

3. Low latency

The goal for training and fine tuning is AI inference. To deliver a good user experience like talking with a co-pilot, AI needs ultra-low latency. Most consumers find the user experience acceptable when network latency is less than 40ms.

Infiniband meets training needs at a premium price

Roughly 50% of NVIDIA’s Q4 revenue booked to tier 1 cloud service providers and consumer internet companies. These are really the same accounts competing in two different markets. Consumer internet companies have multi-billion dollar budgets to train new generative AI models in an arms race to define the next generation user experience. They consume AI cloud infrastructure from their CSP departments who can afford to spend billions of dollars on NVIDIA’s Infiniband networking products. Infiniband meets the high performance needs for this customer segment at a high cost.

Enterprise and sovereign AI clouds do not have unlimited budgets. They need a new network that meets their needs for fine-tuning generative AI models with their own data. Data is the gold in the AI gold rush. Data privacy and protection are paramount for enterprise and sovereign AI cloud customers. This drives many companies and governments to build their own AI infrastructure or rent it from new cloud service providers who do not compete with them as consumer internet companies. They need a new AI network that delivers the performance of Infiniband with the features of Ethernet. We can generally summarize Ethernet features as a cloud user experience.

AI inference is becoming the dominant workload

Ethernet and TCP/IP are the networking standards that run everything on the internet, in our homes, and in our places of work. Ethernet is the network you are using right now to read this blog post. When you use generative AI like ChatGPT, you are using Ethernet.

NVIDIA estimates that 40% of their Q4 data center revenue was for AI inference. This surprised smart analysts like Joe Moore from Morgan Stanley who asked for color on this estimate. The implication is that the market will need more Ethernet and less Infiniband as AI workloads shift from training to fine-tuning and inference. And this is happening faster than many investors expected.

AI fine-tuning and inference need high performance Ethernet

NVIDIA announced that Spectrum X is their reference architecture for AI Ethernet. It uses a combination of NVIDIA Bluefield 3 DPU SmartNICs, NVIDIA Spectrum Switches and software to deliver a high performance ethernet network. Smart investors should expect Spectrum X to account for a larger share of future NVIDIA networking ARR.

Enterprise and Sovereign AI need a cloud user experience

AI needs a network with a cloud user experience. Most enterprise and sovereign AI projects will have multiple tenants. These tenants are different development teams or applications or user groups for GPU cloud infrastructure. Emerging AI cloud networks need to offer the same cloud user experience that everyone enjoys with the Big 3. Multiple tenants need Virtual Private Cloud services for privacy and security. They need gateway services for VPC communication across tenants and locations combined with load balancing and security services.

Hedgehog is the AI Network

Hedgehog is the AI Network for cloud builders serving AI workloads. We deliver high performance network software that works with the NVIDIA Spectrum X reference architecture for high effective bandwidth, zero packet loss and low latency. Hedgehog offers a cloud user experience that makes it easy to operate and use AI cloud networks. Our software is open and automated so our customers can acquire equipment at lower capital expense and run it with lower operating expense. Hedgehog open source software gives customers the freedom to choose their hardware vendor and control their software destiny. Our customers can choose NVIDIA hardware, but they can also choose AMD, Intel, Marvell, Supermicro, Celestica, Dell, or Edgecore equipment for their AI network. With fully automated network operations, our customers can network like hyper-scalers with low operating cost and dynamic cloud capacity.

Hedgehog AI Network delivers $50K minimum ROI per GPU

Marc Austin : Feb 27, 2024 6:03:15 PM

In my “AI Needs a New Network” post last week, I noted that NVIDIA reported $13 billion in networking ARR on $18.4 billion of annual data center...

Juniper AI-Native NOW… or LATER?

Marc Austin : Mar 12, 2024 10:27:34 AM

In case you missed it, Juniper produced a web event last week called AI-Native NOW. I thought it might introduce Juniper as a Hedgehog competitor for...

Events Blog

Hedgehog “Infamous” release is now available for Hedgehog Design Partners

Marc Austin : Nov 28, 2023 9:32:30 AM

I’m pleased to announce that Hedgehog Open Network Fabric powered by SONiC is now available for download at https://githedgehog.com. We developed...

Release distributed cloud cloud Blog