Constructing Knowledge Middle Infrastructure for the AI Revolution

That is half two of a multi-part weblog collection on AI. Half one, Why 2024 is the Yr of AI for Networking, mentioned Cisco’s AI networking imaginative and prescient and technique. This weblog will deal with evolving knowledge middle community infrastructure for supporting AI/ML workloads, whereas the subsequent weblog will talk about the Cisco compute technique and improvements for mainstreaming AI.

As mentioned partially one of many weblog collection, Synthetic intelligence (AI) and machine studying (ML) have lately skilled a steep funding trajectory lately, catapulted by generative AI. This has opened up new alternatives to ship actionable insights and real-world problem-solving capabilities.

Generative AI requires a big quantity of processing energy and better networking efficiency to ship outcomes quickly. Hyperscalers have led the AI revolution with mass-scale infrastructure utilizing 1000’s of graphics processing items (GPUs) to course of petabytes of knowledge for AI workloads, reminiscent of coaching fashions. Many organizations, together with enterprise, public sector, service suppliers, and Tier 2 web-scalers, are exploring or beginning to use generative AI with coaching and inference fashions.

To course of AI/ML workloads or jobs that contain giant knowledge units, it’s essential to distribute them throughout a number of GPUs in an AI/ML cluster. This helps stability the load via parallel processing and ship high-quality outcomes rapidly. To realize this, it’s important to have a high-performance community that helps non-blocking, low-latency, lossless cloth. With out such a community, latency or packet drops could cause studying jobs to take for much longer to finish, or might not full in any respect. Equally, when working AI inferencing in edge knowledge facilities, it’s critical to have a sturdy community to ship real-time insights to a lot of end-users.

Why Ethernet?

The muse for many networks at this time is Ethernet, which has advanced from use in 10Mbps LANs to WANs with 400GbE ports. Ethernet’s adaptability has allowed it to scale and evolve to satisfy new calls for, together with these of AI. It has efficiently overcome challenges reminiscent of scaling previous DS1, DS3, and SONET speeds, whereas sustaining the standard of service for voice and video site visitors. This adaptability and resilience have allowed Ethernet to outlast options reminiscent of Token Ring, ATM, and body relay.

To assist enhance throughput and decrease compute and storage site visitors latency, the distant direct reminiscence entry (RDMA) over Converged Ethernet (RoCE) community protocol is used to assist distant entry to reminiscence on a distant host with out CPU involvement. Ethernet materials with RoCEv2 protocol assist are optimized for AI/ML clusters with extensively adopted standards-based know-how, simpler migration for Ethernet-based knowledge facilities, confirmed scalability at decrease cost-per-bit, and designed with superior congestion administration to assist intelligently management latency and loss.

In accordance with the Dell’oro Group, AI networks will act as a catalyst to speed up the transition to increased speeds. Market demand from “Tier 2/3 and huge enterprises are forecast to be important, approaching $10 B over the subsequent 5 years,” and they’re anticipated to desire Ethernet.

Why Cisco AI infrastructure?

Now we have made important investments in our knowledge middle networking portfolio for AI infrastructure throughout platforms, software program, silicon, and optics. This embrace Cisco Nexus 9000 Sequence switches, Cisco 8000 Sequence Routers, Cisco Silicon One, community working programs (NOSs), administration, and Cisco Optics (see Determine 1).

Determine 1. Cisco AI/ML knowledge middle infrastructure options

This portfolio is designed for knowledge middle Ethernet networks transporting AI/ML workloads, reminiscent of working inference fashions on Cisco unified computing system (UCS) servers. Prospects want selections, which is why we’re offering flexibility with totally different choices.

Cisco Nexus 9000 Sequence switches are built-in options that ship high-throughput and supply congestion administration to assist cut back latency and site visitors drops throughout AI/ML clusters. Cisco Nexus Dashboard helps view and analyze telemetry, and will help rapidly configure AI/ML networks with automation, together with congestion parameters, ports, and including leaf/backbone switches. This resolution offers AI/ML prepared networks for patrons to satisfy the important thing necessities, with a blueprint for community infrastructure and operations.

Cisco 8000 Sequence Routers assist disaggregation for knowledge middle use circumstances requiring high-capacity open platforms utilizing Ethernet—reminiscent of AI/ML clusters within the hyperscaler phase. For these use circumstances, the NOS on the Cisco 8000 Sequence Routers may be third-party or Software program for Open Networking within the Cloud (SONiC), which is community-supported and designed for patrons needing an open-source resolution. Cisco 8000 Sequence Routers additionally assist IOS XR software program for different knowledge middle routing use circumstances, together with super-spine, knowledge middle interconnect, and WAN.

Our options portfolio leverages Cisco Silicon One, which is Cisco chip innovation primarily based on a unified structure that delivers high-performance with useful resource effectivity. Cisco Silicon One is optimized for latency management with AI/ML clusters utilizing Ethernet, telemetry-assisted Ethernet, or absolutely scheduled cloth. Cisco Optics allow excessive throughput on Cisco routers and switches, scaling as much as 800G per port to assist meet the calls for of AI infrastructure.

We’re additionally serving to prospects with their budgetary and sustainability objectives via {hardware} and software program innovation. For instance, system scalability and Cisco Silicon One energy effectivity assist cut back the quantity of assets required for AI/ML interconnects. Prospects can entry community visibility into precise utilization of energy and carbon footprint reminiscent of KWh, price, and CO2 emissions through Cisco Nexus Dashboard Insights.

With this AI/ML infrastructure options portfolio, Cisco helps prospects ship high-quality experiences for his or her end-users with quick insights, via sustainable, high-performance AI/ML Ethernet materials which can be clever and operationally environment friendly.

Is my knowledge middle able to assist AI/ML functions?

Knowledge middle architectures should be designed correctly to assist AI/ML workloads. To assist prospects accomplish this objective, we utilized our intensive knowledge middle networking expertise to create a knowledge middle networking blueprint for AI/ML functions (see Determine 2), which discusses how one can:

Construct automated, scalable, low-latency, Ethernet networks with assist for lossless transport, utilizing congestion administration mechanisms reminiscent of express congestion notification (ECN) and precedence stream management (PFC) to assist RoCEv2 transport for GPU memory-to-memory switch of data.
Design a non-blocking community to additional enhance efficiency and allow sooner completion charges of AI/ML jobs.
Shortly automate configuration of the AI/ML community cloth, together with congestion administration parameters for quality-of-service (QoS) management.
Obtain totally different ranges of visibility into the community via telemetry to assist rapidly troubleshoot points and enhance transport efficiency, reminiscent of real-time congestion statistics that may assist establish methods to tune the community.
Leverage the Cisco Validated Design for Knowledge Middle Community Blueprint for AI/ML, which incorporates configuration examples as greatest practices on constructing AI/ML infrastructure.

Determine 2. Cisco AI knowledge middle networking blueprint

How do I get began?

Evolving to a next-gen knowledge middle is probably not easy for all prospects, which is why Cisco is collaborating with NVIDIA® to ship AI infrastructure options for the info middle which can be straightforward to deploy and handle by enterprises, public sector organizations, and repair suppliers (see Determine 3).

Determine 3. Cisco/NVIDIA partnership

By combining industry-leading applied sciences from Cisco and NVIDIA, built-in options embrace:

Cisco knowledge middle Ethernet infrastructure: Cisco Nexus 9000 Sequence switches and Cisco 8000 Sequence Routers, together with Cisco Optics and Cisco Silicon One, for high-performance AI/ML knowledge middle community materials that management latency and loss to allow higher experiences with well timed outcomes for AI/ML workloads
Cisco Compute: M7 era of UCS rack and blade servers allow optimum compute efficiency throughout a broad array of AI and data-intensive workloads within the knowledge middle and on the edge
Infrastructure administration and operations: Cisco Networking Cloud with Cisco Nexus Dashboard and Cisco Intersight, digital expertise monitoring with Cisco ThousandEyes, and cross-domain telemetry analytics with the Cisco Observability Platform
NVIDIA Tensor Core GPUs: Newest-generation processors optimized for AI/ML workloads, utilized in UCS rack and blade servers
NVIDIA BlueField-3 SuperNICs: Goal-built community accelerators for contemporary AI workloads, offering high-performance community connectivity between GPU servers
NVIDIA BlueField-3 knowledge processing items (DPUs): Cloud infrastructure processors for offloading, accelerating, and isolating software-defined networking, storage, safety, and administration features, considerably enhancing knowledge middle efficiency, effectivity, and safety
NVIDIA AI Enterprise: Software program frameworks, pretrained fashions, and improvement instruments, in addition to new NVIDIA NIM microservices, for safer, secure, and supported manufacturing AI
Cisco Validated Designs: Validated reference architectures designed assist to simplify deployment and administration of AI clusters at any scale in a variety of use circumstances spanning virtualized and containerized environments, with each converged and hyperconverged choices
Companions: Cisco’s world ecosystem of companions will help advise, assist, and information prospects in evolving their knowledge facilities to assist AI/ML functions

Main the best way

Cisco’s collaboration with NVIDIA goes past promoting present options via Cisco sellers/companions, as extra technological integrations are deliberate. By these improvements and dealing with NVIDIA, we’re serving to enterprise, public sector, service supplier and web-scale prospects on the info middle journeys to totally enabled AI/ML infrastructures, together with for coaching and inference fashions.

We’ll be at NVIDIA GTC, a worldwide AI convention working March 18–21, so go to us at Sales space #1535 to be taught extra.

Within the subsequent weblog of this collection, Jeremy Foster, SVP/GM, Cisco Compute, will focus on the Cisco Compute technique and improvements for mainstreaming AI.