How to make sure SONiC is failsafe

By Kamal Sahu | Modern day data centers rely on IPCLOS architecture, which involves few hundreds to thousands of switches. Maintenance and upgrade of these switches are inevitable. In a recent study by Microsoft, researchers studied the switch reboots as one of the main points of failure. Reboot is often used as a recovery from failures, including hardware and software issues (see Microsoft paper). Now, many Switch and ASIC vendors are supporting the features where we can upgrade or reboot a particular service or a system as a whole with very minimal interruption or no impact on services.

Various type of reboot options are available depending on the intended areas of updates and maintenance. SONiC, as the leading open-source Network Operating System (NOS), provides the support of four different reboot options, that is, warm-reboot, fast-reboot, cold-reboot, and soft-reboot. The following table has example use cases of various reboots:

Reboot

The following design documents can be found under github:

Warm-reboot is one of the most common disruption in data centers to upgrade the devices or to install a patch for an existing issue residing in the code. These reloads are triggered not only for normal maintenance but in case of any failure that is detected. So this is an area of concern and needs to be thoroughly tested. The most important KPI for a data center is least service interruption time and by testing these reboot scenarios, one can guage that KPI. This is one of the most important selection criteria that network operators should measure for device selection.

Thanks to high precision test and measurement tools from Keysight, we can measure such performance to facilitate easy vendor sepection. Here is a test plan link that we reviewed with the SONiC community:

Test plan

As part of the SONiC PlugFest 2021 testing, we tested various reboot options for multiple iterations and gathered the results in our Keysight lab. These tests were run across various vendor switches and SONiC versions. The results can be used as a metric to qualify and compare SONiC platforms (combination of hardware and software). In our conversations with SONiC end user group (customer advisory board) comprising of hyperscalers, enterprises, and others as well as switch and ASIC vendors, we came across the requirement to baseline this performance metric. You can read more in our SONiC PlugFest report. The following is a sample test result for various reboot options:

Reboot Type

Comparison chart

Comparison Chart

Conclusion

Looking at the preceding results, it is obvious that there is a lot of variance that we see across the release branches. Interestingly, when we run the same tests with different combinations of ASICs and switch platforms, we also see a difference in the performance. This indicates that such a criteria can be critical for vendor selection and day operations testing to have a successful deployment and to be ready for failures.

The following are additional links for more information:

limit
3