Allowing Path Computation Engine (PCE) scalability without impact to the switch control plane.
As we saw in parts 1 and 2 of this series, large scale global network implementations involve complexities of size and geographical distribution, but they inherently also have less reliable network infrastructure for control, and most SDN controllers struggle with this.
A key strength of OpenKilda is the decoupling of the Path Computation Engine from the switch control functionality. This decoupling minimises its vulnerability to high-latency, unreliable control planes that cause other SDN controllers to fall into a constant state of churn from trying to re-converge the network.
Although complex, the two most important and computationally expensive parts of any SDN Controller are the Path Computation Engine (PCE) and the South-Bound Interface (SBI).
The PCE is tasked with calculating routes for different traffic through the infrastructure given both hard connection requirements and current state of the links between switches, while also considering constraints such as bandwidth and quality of service (QoS).
The SBI handles connecting to the deployed switches and updating their flow tables in accordance with the configuration calculated by the PCE. Additionally, the SBI is responsible for collecting telemetry from the switches, which is passed as an input to the PCE process.
As discussed in earlier posts, when an SDN controlled WAN footprint grows, the overhead of maintaining the switch’s flow tables and returning telemetry feeds means separating the PCE from the core modules of the SDN Controller becomes necessary due to load alone.
There are however a number of other reasons to separate the functions that are not as immediately obvious.
With geographically diverse or high latency WAN implementations there is often extra information to be taken into account when calculating paths, such as WAN link cost and congestion. This information is not available through the SBI telemetry feed and must be fed into the PCE some other way. How do we ingest the extra information from external systems without needing to plumb it deep into our production network?
Decoupling the PCE allows it to be deployed further from the switches than the SBI, logically closer to external systems (portals, databases, users) we don’t want near the switch layer, and physically closer to enterprise storage solutions. The former means we can gather information without compromising our data plane environment, sending only path updates to the SBI. The latter gives us access to storage for telemetry data, useful for capacity management, visualization, and operational support tools.
As an additional bonus, the ability the horizontally scale the PCE infrastructure, coupled with access to historical data, opens pathways to tools from Big Data and Machine Learning disciplines. Statistical modelling, long-term trend analysis and policy enforcement can all be added over time without affecting our underlying control of the switches themselves.
With ever more complex networks being deployed, harnessing the best available tools is critical. By separating SBI and PCE functionality we allow a much more flexible approach to adopting tools without affecting our ability to control the data plane. This vital functionality, along with being able to manage and process large amounts of data whether it be OpenFlow messages or telemetry using web-scale packages like Kafka, all together make OpenKilda the only SDN controller that can be classed as truly Web-Scale.