AI Needs a New Network
Most investors associate NVIDIA with GPUs and recognize GPUs as the key ingredient for AI cloud infrastructure. But that’s only part of the AI...
In case you missed it, Juniper produced a web event last week called AI-Native NOW. I thought it might introduce Juniper as a Hedgehog competitor for owning the AI Network category. It turns out we don’t have to worry about Juniper as an AI Network competitor, at least for now.
This was a well-produced event, but the news was mostly Juniper catching up to Cisco in AI Ops capability. Juniper also introduced a natural language interface for network operations, which in our opinion is a nice-to-have feature that customers will adopt after they have covered the AI Network bases.
Our customers define the AI Network as a cloud network that:
The Juniper AI Native NOW announcement addresses the fully automated solution requirement with AI Ops, but they didn’t announce anything that addresses the primary AI Network requirements for AI workload performance or cloud UX.
The lead message at AI Native NOW was about AI Ops. AI Ops is the practice of using data collection and machine learning to baseline network performance metrics, identify performance anomalies, correlate network faults to incidents, and correlate incidents to performance anomalies. The benefits of AI Ops are measured not by AI job completion times, but mean time to identify issues, determine root cause, and resolve tickets.
Juniper does this with their Mist AI product. If you examine that product page, you’ll see that Juniper does a great job of marketing Mist as an assurance solution for Wi-Fi, wired, WAN, IoT and access networks. What is sorely missing for this to be an AI Network product is anything that mentions flow orchestration or congestion management for AI workloads. This,of course, is what Hedgehog tackles with our AI Network.
Juniper Mist AI is catching up to the Eagle Eyes assurance solution architecture that I helped define and build at Cisco several years ago. We inherited a product called Crosswork Situation Manager, which was a Cisco OEM of Moogsoft. Juniper is correct that collection and enrichment of data from network devices is a big piece of the AI Ops puzzle. We required Moogsoft to integrate with Crosswork Data Gateway for data collection, which they really didn’t want to do. Chris Menier was much more eager to do this with his VIA AI Ops product, so we gave him the opportunity to carry the project forward. Later we added Accedian to the solution for probing that tests the network and generates more data for AI Ops ingestion. Juniper’s Mist AI looks like a tighter version of Cisco’s AI Ops solution.
While AI Ops is indeed useful for efficient cloud data center operations, Hedgehog is focusing on a high performance data plane that prevents congestion bottlenecks created by AI workloads.
Our goals for a high performance AI ethernet fabric are:
At 95% load, AI networks require orchestrated and controlled traffic management. Without this, AI Ops solutions like Mist AI will indeed report network congestion and performance anomalies. Networks operating at high effective bandwidth will lead to unfair GPU usage due to incast loads, misaligned load distributions, and a wide spectrum of GPU-to-GPU latencies. This will result in a prolonged tail of delayed computations which in turn will extend job completion time. Hedgehog’s high performance AI data plane includes orchestrated and controlled traffic management to deliver 95% efficient GPU usage.
While we prevent AI performance issues from happening in the first place, we also collect and enrich data from all the network devices we support. This includes switches from Supermicro, Celestica, Micas Networks, Dell and Edgecore today. In the very near future it will also include DPU SmartNICs from NVIDIA, AMD, Marvell and Intel as well. We then integrate that data with cloud-native observability tools.
Most of our customers already use a common set of cloud-native tools for observability in their cloud operations. We could take the same approach as Arista, Cisco and Juniper: build proprietary dashboards and charge our customers to use them. Instead, we are choosing a more customer-friendly approach to operating a cloud network at low cost. We simply push the data we collect into the open-source, cloud-native tools our customers already use for their cloud operations. These tools include the Prometheus systems monitoring and alerting toolkit, and the Grafana Loki logging stack.
Most investors associate NVIDIA with GPUs and recognize GPUs as the key ingredient for AI cloud infrastructure. But that’s only part of the AI...
In my “AI Needs a New Network” post last week, I noted that NVIDIA reported $13 billion in networking ARR on $18.4 billion of annual data center...
I’m pleased to announce that Hedgehog Open Network Fabric powered by SONiC is now available for download at https://githedgehog.com. We developed...