Hedgehog AI Network

Juniper AI-Native NOW… or LATER? - Hedgehog

Written by Marc Austin | Mar 12, 2024 5:27:34 PM

In case you missed it, Juniper produced a web event last week called AI-Native NOW.  I thought it might introduce Juniper as a Hedgehog competitor for owning the AI Network category.  It turns out we don’t have to worry about Juniper as an AI Network competitor, at least for now.  

This was a well-produced event, but the news was mostly Juniper catching up to Cisco in AI Ops capability.  Juniper also introduced a natural language interface for network operations, which in our opinion is a nice-to-have feature that customers will adopt after they have covered the AI Network bases.  

Hedgehog AI Network

Our customers define the AI Network as a cloud network that:

  1. Meets the unique requirements of  AI workloads with a high performance ethernet fabric.
  2. Makes it easy to operate AI cloud infrastructure with a familiar cloud user experience.  This UX consists of foundational cloud services comparable to those offered by AWS, Azure and GCP.
  3. Makes it possible to invest in private AI cloud infrastructure with open-source software that reduces capex and a fully automated solution that reduces opex.  

The Juniper AI Native NOW announcement addresses the fully automated solution requirement with AI Ops, but they didn’t announce anything that addresses the primary AI Network requirements for AI workload performance or cloud UX.  

AI Ops

The lead message at AI Native NOW was about AI Ops.  AI Ops is the practice of using data collection and machine learning to baseline network performance metrics, identify performance anomalies, correlate network faults to incidents, and correlate incidents to performance anomalies.  The benefits of AI Ops are measured not by AI job completion times, but mean time to identify issues, determine root cause, and resolve tickets.  

Juniper AI Ops

Juniper does this with their Mist AI product.  If you examine that product page, you’ll see that Juniper does a great job of marketing Mist as an assurance solution for Wi-Fi, wired, WAN, IoT and access networks.  What is sorely missing for this to be an AI Network product is anything that mentions flow orchestration or congestion management for AI workloads.  This,of course, is what Hedgehog tackles with our AI Network.  

Cisco AI Ops

Juniper Mist AI is catching up to the Eagle Eyes assurance solution architecture that I helped define and build at Cisco several years ago.  We inherited a product called Crosswork Situation Manager, which was a Cisco OEM of Moogsoft.  Juniper is correct that collection and enrichment of data from network devices is a big piece of the AI Ops puzzle.  We required Moogsoft to integrate with Crosswork Data Gateway for data collection, which they really didn’t want to do.  Chris Menier was much more eager to do this with his VIA AI Ops product, so we gave him the opportunity to carry the project forward.  Later we added Accedian to the solution for probing that tests the network and generates more data for AI Ops ingestion.  Juniper’s Mist AI looks like a tighter version of Cisco’s AI Ops solution.  

Hedgehog Assurance Strategy

 

While AI Ops is indeed useful for efficient cloud data center operations, Hedgehog is focusing on a high performance data plane that prevents congestion bottlenecks created by AI workloads.

Congestion Management for AI Workloads

Our goals for a high performance AI ethernet fabric are: 

  1. High effective bandwidth
  2. Zero packet loss
  3. Low latency

At 95% load, AI networks require orchestrated and controlled traffic management.  Without this, AI Ops solutions like Mist AI will indeed report network congestion and performance anomalies.  Networks operating at high effective bandwidth will lead to unfair GPU usage due to incast loads, misaligned load distributions, and a wide spectrum of GPU-to-GPU latencies.  This will result in a prolonged tail of delayed computations which in turn will extend job completion time.  Hedgehog’s high performance AI data plane includes orchestrated and controlled traffic management to deliver 95% efficient GPU usage.  

Data Collection and Enrichment

Hedgehog observability stack with data collection, enrichment and integration to Prometheus and Grafana Loki

While we prevent AI performance issues from happening in the first place, we also collect and enrich data from all the network devices we support.  This includes switches from Supermicro, Celestica, Micas Networks, Dell and Edgecore today.  In the very near future it will also include DPU SmartNICs from NVIDIA, AMD, Marvell and Intel as well.  We then integrate that data with cloud-native observability tools.  

Cloud-Native Observability Tools

Most of our customers already use a common set of cloud-native tools for observability in their cloud operations.  We could take the same approach as Arista, Cisco and Juniper: build proprietary dashboards and charge our customers to use them.  Instead, we are choosing a more customer-friendly approach to operating a cloud network at low cost.  We simply push the data we collect into the open-source, cloud-native tools our customers already use for their cloud operations.  These tools include the Prometheus systems monitoring and alerting toolkit, and the Grafana Loki logging stack.