Benchmarking datacenter KPIs for SONiC

By Kamal Sahu | Modern day data centers are simplifying their design by using IPCLOS leaf spine architecture to implement their switch fabrics. Border Gateway Protocol or BGP is the key protocol that keeps the network running behind the scenes. Over years, BGP has developed different applications but each application needs to be benchmarked with respect to the KPIs of the industry it is serving. The cloud titans, hyperscalers have very strict requirements when it comes to failover times to maintain the SLAs that they commit to. The requirement is then pushed down to the network operators, eventually to the whole supply chain down to the NEMs and ASIC manufacturers to ensure that at each level, we are doing checks that the end goal is met.

SONiC is one of the leading Network Operating System (NOS) that is becoming incumbent in today’s datacenters with more and more enterprises embracing it. In our recent interaction with the SONiC community, we came across a common ask to do benchmarking test for BGP and ECMP convergence scenarios. Next, we did some pilot testing with different ASIC varieties in the Keysight lab. No surprises that we saw a lot of variance going from one ASIC to another. See the following sample data from one of the BGP ECMP convergence test across different ASICs:

Convergence

Figure 1: Comparison of convergence times across ASICs

This led to a lot of interest in defining a benchmark test that the community can leverage to make sure we meet the KPIs of the applications where the switches will end up. We did different iterations and combinations of this test and presented a comprehensive test plan along with the preliminary convergence benchmark results to the community as part of the test working group meeting. It garnered a lot of interest from all the community members. Now, this test plan is part of the community test cases and can be found in the SONiC GitHub repository for the benefit of the community. The links to this test plan and the meeting are as follows:

Here is a matrix of the tests in summary:

DUT as TOR—Convergence and Resiliency
Convergence performance when local links fail
Convergence performance when remote links fail (routes withdraw)
Convergence performance when node failure or maintenance
Convergence performance with various local preference and attributes
Convergence performance with LACP and BFD
RIB-IN Convergence—how quickly DUT installs routes and forwards traffic
DUT as TOR—Scale and Performance
Convergence performance with 32-way ECMP
Maximum IPv6, IPv4 RIB capacity test
DUT as leaf—Convergence and Resiliency
RIB-IN Convergence—how quickly DUT installs routes and forwards traffic
Convergence performance when local/remote links failure occurs, node failure occurs, or maintenance occurs
DUT as leaf—Scale and Performance
Maximum IPv6, IPv4 RIB capacity test

It sure is a déjà vu for experienced network engineers and acknowledges the need of the correct high precision test tools that can enable the measurement of control plane and data plane convergence for visibility of system performance. The SONiC community test lacked such high precision testing that is now enhanced with the Keysight tools. Browse the following links to find more information on how Keysight is accelerating the adoption of SONiC by hardening its quality:

limit
3