Insights > High Volume Traffic Generators

SONiC – Works Good, Lasts a Long Time

2021-05-13  |  7 min read 

Back in the day, the valley (Silicon Valley) was a somewhat different place. Mountain View still bore traces of a mixed rough/military crowd including gems like St. James Infirmary – a wonderful bar with 100+ beers on tap featuring a 25’ tall Wonder Woman and other eclectic décor that was popular with Moffett Field fliers and tech geeks until it burned down in the late 90’s – apparently victim to the sort of flammability that not very profitable businesses tend to suffer. Now Mountain View is a vastly different place. Moffett field is still there but you are much more likely to run into typical tech brogrammers than anything else on Castro Street downtown.

Back then, some of the most exalted in the world of tech were not the brogrammers but rather the Unix Beards. They came in different somatotypes, usually rail thin or 3xl+ gamer plump, but almost invariably graced with great bushy beards. Indeed it was rumored that the bigger and more unkept the beard, the stronger the /root. I recall having struck up a TGIF conversation with one such fellow, generous of both beard and girth, who vouched his great love for the OS in a way that only one well on the spectrum could: “Works good, lasts a long time,” – an invocation he repeated like a mantra for just about anything he held in high esteem. UNIX – works good, lasts a long time. Classic air cooled BMW motorcycles – works good, lasts a long time. Old HP calculators – works good, lasts a long time.

This brings us to SONiC. I’ll be up front, we have no horse in this race. Keysight doesn’t make or sell network operating systems nor do we sell aggregated network switches or routers containing a network operating system. We do however sell products and solutions that help NEMS and hyperscalers and folks running data centers test and validate network gear.

Recently we have seen a lot of excitement in the industry around the open networking operating system, SONiC. There are some good reasons for this. One of which is the price, which at $FREE, is viewed by many as being “right.” But there are other reasons, as the charm of SONiC goes beyond its price. One is SAI – the Switch Abstraction Interface – this is the piece that lets you take the switch OS, in this case SONiC, and port it to just about any switch hardware running SAI because SAI provides the abstraction layer that bridges the OS, which doesn’t change, to the switch hardware, which often does. Other factors include a modular, containerized architecture where network applications like BGP, SNMP, DHCP and IPv6 are modular and containerized – with obvious benefits to maintainability and supportability.

So on the surface, SONiC seems like something that should be of considerable interest to NEMs, hyperscalers and even large enterprises that view the data center as a strategic advantage. And it is. But now we have proof about why.

Recently, in the April edition of ACM SIGCOMM Computer Communication Review, Antichi et al wrote about Microsoft’s experiences with 180,000 switches in 130 locations over three months, in Surviving switch failures in cloud data centers - http://www.racheesingh.com/papers/sigcomm-ccr-final465-with-open-review.pdf

Some interesting conclusions:

  1. Who makes your switch matters – across three vendors in the study, the vendor most prone to failure was twice as likely to fail as the others.
  2. The majority of failures (59%) were due to hardware (32%) or power failure (27%)
  3. Here’s the money shot – “SONiC switches have a significantly lower likelihood to fail” and “replacing vendor switch OSes with SONiC has been beneficial in improving the resilience of data center switches.”
SONiC vs Non-SONiC Switch Reliability
SONiC – works good, lasts a long time

Of course one of the first questions for most would be something to the effect of “what makes SONiC switches more robust?” One possible explanation is that when running open source solutions such as SONiC with a robust in-house team of experts who wrote the NOS, the develop-test-deploy cycle is greatly accelerated and can take place over days while vendor upgrades and patches are rolled out on a longer timescale more like several months than several days. This implies that even in cases where the cause of an issue with a vendor OS is known, that the problem will continue to occur until that patch or update is finally available.

So, to wrap things up, while Microsoft’s experience with SONiC powering their Azure data centers may in ways be unique, it has also been very positive so far with SONiC switches being more reliable than vendor NOS switches.

Like the UNIX Beard said – SONiC – works good, lasts a long time.

Thanks for reading.

Further: