Critical Differences Between Deploying SD-LAN and SD-WAN in the SDN World
As we saw in part 1 of this series, there are challenges with country-scale SD-WANs in addition to those faced by their more data centre focused SD-LAN counterparts. This post covers how two common maintenance paradigms from the server infrastructure automation space, Repave and Update, are being used in both of these SDN scenarios.
First, a quick overview of the two widely used methods of maintaining the state of larger IT systems. The aim is to provide organizations a way of installing and updating their infrastructure easily without adversely affecting users in the process.
This method of system maintenance sees all intended changes as a trigger to reboot and reinstall operating systems and software. Updates are implemented in the same way as fresh installs, and follow these general steps:
- Migrate all user services (VMs, networks, HA configurations) seamlessly away from the host in question
- Reboot and reinstall the server with a new known image, including software changes
- Migrate user services back to the host after it joins the pool of resources again
In environments at scale, where there is redundancy and high availability built in, this method can be extremely efficient and pose less risk than constant small updates. The infrastructure supporting the servers to be updated (load balancers, shared storage) render the hosts themselves a fungible resource.
In contrast, rolling out small changes to hosts over time without completely rebuilding generally means that user services don’t need to go through a migration step before or after updates. There are obviously circumstances where a repave or migration is necessary, but updates are more normally of the following form:
- System state is checked against new baseline configuration
- Updates to services and software are applied to live hosts
This approach can work extremely well, but also increases the risk of systems diverging from the known baseline over time. For this reason, in systems at scale, this method is less regularly used once the appropriate support services are in place.
These practices are well-known in the server automation spaces, but how are they applied in SDN environments?
If we look once again to OpenFlow based SDNs, similar concepts are being applied to the way flows are handled when a switch connects or reconnects to a controller.
OpenDaylight views these connection events as a trigger to remove the existing flow rules and repave the flows from their in-memory model. Any existing flows are deleted, affecting network traffic until new ones are laid down. In a Data Centre centric installation, this is not so much of a problem – the chances of losing connectivity between controllers and switches is low. Between negligible latency and robust redundancy at a local scale, instances of falsely flagging a switch as offline and starting a repave are rare. The resulting outages can be transparently handled by redundant links from the connected server infrastructure.
In a SD-WAN however, the outcome can be catastrophically different.
Controllers are generally more centrally located, with child switches distributed over large management networks. Multiple co-located switches sit in clusters in these management networks, receiving updates from their remote SDN controllers. Customer services connecting to these SD-WAN sites are typically minimally redundant edge connections using a local POP to hook into the larger infrastructure.
Now imagine an SDN management connection is broken or congested to these remote sites. The switches will happily continue using in-memory flow tables up until the point when the management connection is restored. At this point, the controller views the servers as having reconnected, and starts repaving the entire remote site in a series of parallel updates to all the switches. Not only does this kill all traffic flows, it can easily lead to management network congestion, and full network failure cascades as thousands of flows are laid in.
OpenKilda and similar controllers take the update approach in this scenario.
As switches reconnect after an outage (real or perceived by the controller), an inventory of the current flow table is created, while leaving the current table intact. This allows for a much subtler approach to manipulating the flows in the event of an outage. Firstly, the controller can take its time to reconcile the actual outage state. Extra telemetry from neighboring switches or the management network itself can provide critical context, instead of a binary action taken on a reconnection.
If updates are required, they can be managed individually, allowing much smaller sets of changes with a smaller blast radius.
The differences between SD-LAN and SD-WAN are subtle but crucial, and context is important.
Join us next week when we discuss the advantages of separating the OpenFlow control channel from the Path Computation Engine.