Hedgehog “Infamous” release is now available for Hedgehog Design Partners
I’m pleased to announce that Hedgehog Open Network Fabric powered by SONiC is now available for download at https://githedgehog.com. We developed...
3 min read
Marc Austin : Feb 27, 2024 6:03:15 PM
In my “AI Needs a New Network” post last week, I noted that NVIDIA reported $13 billion in networking ARR on $18.4 billion of annual data center revenue. This week, we are diving a little deeper on the unit economics of GPUs, the Hedgehog AI Network, and LLMs offered by Hedgehog customers.
Morgan Stanley estimates that NVIDIA sold 608,000 GPUs in the quarter ended Jan. 28, bringing the total number of units sold since 2021 to more than 3.3 million. We conservatively assume all 3.3 million GPUs are actively used in the field (Total Active GPUs).
If you simply divide Total Active GPUs by $13 billion networking ARR, you get a quotient of $3,900 in networking ARR-per-GPU. That networking ARR is mostly for Infiniband subscriptions,which connect GPUs peer-to-peer in back-end training networks.
The $3,900 per-year figure sounds expensive — until you consider the cost and value of a GPU. Morgan Stanley estimates the average sales price of a H100 GPU to be $30,000, and the blended ASP across all GPUs sold last quarter was $21,700. When you consider how much you are spending on a GPU, paying another 18-to-20% to ensure maximum utilization of your GPU is a no-brainer, especially when the math (see below) shows an incredible return on investment (ROI).
Dell’Oro Group notes that “800 Gbps [Ethernet] is expected to comprise the majority of the ports in AI back-end networks by 2025.” This means $13 billion of NVIDIA Infiniband ARR will migrate to Ethernet. The market shift will happen as Hedgehog customers deploy our high-performance AI network.
Hedgehog doesn’t do this alone. We build AI network software that works together with hardware from partners like Ram Velaga at Broadcom and the Spectrum X team at NVIDIA. We can deliver better performance than traditional Ethernet for AI workloads at much lower TCO than Infiniband. In fact, we predict that we can deliver this at better performance than Infiniband, too.
Industry observers like Dell’Oro acknowledge that “One could argue that Ethernet [hardware] is one speed generation ahead of InfiniBand. Network speed, however, is not the only factor. Congestion control and adaptive routing mechanisms are also important.” These congestion control and adaptive routing mechanisms require software from Hedgehog to deliver a complete AI Network solution.
NVIDIA knows this is inevitable. That’s why the tech giant announced its plans to launch Spectrum X this quarter. The goal is to improve Ethernet effective bandwidth by 35% broadly. NVIDIA says AI workloads create congestion that limits traditional Ethernet networks to 60% effective bandwidth. Conversely, Spectrum X has a design goal of raising performance to 95% effective bandwidth. Hedgehog shares this performance goal with congestion control and adaptive routing software that uses Spectrum X hardware to deliver 95% effective bandwidth for Hedgehog
AI Ethernet. This means that if you invest in NVIDIA, Broadcom, or AMD hardware with 800Gbps Ethernet ports, you effectively get 760Gbps with a Hedgehog AI Network. That’s compared to 480Gbps with a traditional Ethernet network when you run AI workloads.
So what is 95% effective bandwidth worth? To answer that question, simply look at market prices on GPU time, then correlate effective AI network bandwidth with Job Completion Time. (Here’s one source of data on market prices for LLM inference models.) As I am writing this post, DeepInfra is the price leader at $0.27 per minute for mixtral-8x7b, while OpenAI charges $30/min for GPT4. Fully utilized, a single DeepInfra GPU has a theoretical annual market value of $142,000. This is not possible since Job Completion Time is constrained by effective bandwidth of the AI network. With 60% effective bandwidth from traditional Ethernet, a DeepInfra GPU generates only $85,000 annually. With a Hedgehog AI network, it will generate $135K for a $50K ROI.
These numbers, of course, get a lot bigger for a customer like Together.AI, which prices llama2-70b-chat at $0.90 per minute (3x or $150,000). If a Hedgehog customer pays the Infiniband price of $3900, the ROI is 13X for DeepInfra or 38X for Together.AI. I mentioned before that we can deliver comparable performance at a better price, so the percentage ROI is actually a lot higher for Hedgehog customers.
I’m pleased to announce that Hedgehog Open Network Fabric powered by SONiC is now available for download at https://githedgehog.com. We developed...
AI cloud builders now have a complete solution for open AI infrastructure. Hedgehog now supports Supermicro switches, and Supermicro includes...
Want to experiment with the idea of running SONiC in your distributed cloud, but don’t want to commit to purchasing hardware? You can start with the...