One of our Telecommunications customers operated an existing platform and had identified that there was a high load on the operations team who were detecting and correcting a large number of events that occurred within the platform. These events were captured in ZenOSS – the platform monitoring system.
Correction of these problems required manual intervention by the operations team which resulted in higher than desired downtime on their platform. Thanks to ZenOSS, the customer observed that many of the events fell into a relatively small number of repeating types of problem, each with a common rectification process.
To decrease the downtime on their platform, as well as reducing load on their operations team, they would like to automate issue detection and rectification using existing and/or standard tools. This process is called this Auto Remediation.
The technical context for the Auto Remediation solution is complex, and their Platform includes many hardware and software components:
The solution design needed to cover the full component scope of their platform, as well as address the highly technical context required for Auto Remediation.
The Aptira Solution
Aptira proposed an Auto remediation solution based on Ansible tower. Ansible Tower is a web-based solution designed to make Ansible even easier on operations teams and is easily integrated with existing systems.
Having analysed the problem and correction patterns, Aptira configured Zenoss triggers on the Zenoss platform to match these patterns. When errors occur on their platform, an error event will be generated on Zenoss and a trigger will be executed. Each trigger action is configured to execute a specific workflow on the Ansible tower, which will run Ansible playbooks to perform the corrective action that results in the platform returning to a normal state.
To ensure the operations team is kept informed of these correction processes, the error events and corrective actions are sent to the appropriate contacts in a Slack channel.
Aptira successfully configured Zenoss triggers, playbook execution and notification on the platform. Aptira tested the solution extensively by triggering failures in many of the services on the Platform. By manually monitoring the results and verifying that the correct Auto-remediation actions were taking place, Aptira was able to validate the operation of the new solution.
Auto Remediation now functions as the customer intended, and performs the following functions:
- ZenOSS continuously monitors the platformperformance and detects the issues
- On detection of issues in the platform, ZenOSS triggers Ansible Tower to run resolution scripts in the form of Ansible playbooks
- The Operations team is notified via a Slack message with details like issue details, resolution steps taken and current issue status