Do you really know how long a network failure takes to recover?
2020-10-28 | 7 min read
Customers routinely engage our Professional Service Team here at Keysight, to test new infrastructure and configurations. We test to validate if the network equipment is performing as designed and whether it is ready to be moved into a production environment. It is common during these Professional Services engagements that some of the test results prove sub-optimal performance levels and configuration issues. This results in working with the customer to change configurations and retest until the results are acceptable, and then the testing moves on to the next phase.
Testing and validation is an essential stage for any new network deployment or transformation, Testing ensures validation of the performance characteristics of a new design or equipment deployment, future capacity requirements are met, uncovers any configuration errors, and the effectiveness of security devices protecting your organization against sophisticated attacks.
During a recent engagement with a retail bank, the design and infrastructure that was being tested had a Web Application Firewall (WAF) deployed in the traffic path at the edge of the network. It was deployed in a High Availability (HA) configuration with redundant Active/Standby network paths, so traffic could not enter or leave the network without inspection by the WAF.
The timings quoted by the WAF vendor to failover between the Active and Standby nodes was approximately 500 milliseconds (ms). This was deemed acceptable by the customer during the vendor selection process.
One key item that wasn’t considered during the selection, was how long it takes for the Standby node to detect a failure of the “Active” node and take over the role of the active node.
To detect failure the WAF sends out heartbeats and waits for their return. This takes approximately 4 milliseconds when operating normally and can be set at a minimum frequency of 1 heartbeat per second. This means the actual minimum failover time is just under 1.5 seconds (The minimum heartbeat frequency + failover time). This assumes the failure has occurred just before a heartbeat has been sent. The actual range for a failure being detected, with the failure criteria being set to a single heartbeat not being returned, is actually between 1.496 and 2.496 seconds. Sending one single heartbeat to detect failure would be deemed an aggressive configuration and in practice the times will be longer.
A less aggressive and more typical configuration to detect a failure would be to wait for three heartbeats not to been returned to detect a failure, in this case, the period of the outage would be between 3.496 to 4.496 seconds.
Every 4.5 seconds approximately 12,000 logons to the banks Online Banking Website or Phone App and 450 credit & debit card transactions occur. This failover period would result in a 100% transaction failure, resulting in significant lost revenue and negative customer experience.
When discussing this with the customer we advised there was an alternative topology available, this alternative topology would reduce failover times significantly to approximately 200ms, thus improving scalability, performance and resilience of their edge network security toolset.
This alternative topology was to deploy an Ixia Bypass Switch combined with Ixia Vision Network Packet Broker (NPB), with their WAF and any other edge security and monitoring tools connected to the NPB.
The heartbeat/heath check frequency on the Vision NPB can be configured to a minimum frequency of 50ms, this means that even with 3 failed health checks configured, that the failover time is reduced to sub-200ms, a huge improvement on the 4.5 seconds prior to deploying the NPB.
The Vision NPB also has the capability to send Negative Heath Checks, this is a health check that should be blocked by the tool. If the negative health check is returned, the NPB knows there is something wrong with the tool and the failure policy can be actioned.
The WAF and other tools can be deployed in Active / Active configurations, even when the network links are configured as active/standby allowing for increased processing capacity, reduced failover times, and your assets to be sweated more efficiently.
Multiple Bypass Switches can be connected to a single High Availability NPB fabric, which allows the traffic on different Network segments to be inspected on the same instance of the tool regardless the network segment it arrives on.
Session Aware Load Balancing allows for Network Traffic to be spread across multiple instances of a tool, allowing for improved toolset resilience and additional processing capacity – which is hard to achieve with an inline tool deployed directly on the wire without buying large expensive versions of the tool, which often do not have the same capacity as the wire speed.
One last benefit of the NPB deployment was that it included Active-SSL decryption, this was a function previously carried out on the WAF. Decrypting on SSL/TLS traffic on the NPB reduced load on the WAF, allowing its resource to be better focused on its core functionality, and allowing for a decrypted flow to be sent to any other tools which are connected to the Packet Broker.
I am keen to discuss how Keysight can improve edge and/or inline security posture. You can reach out to me at email@example.com