One of our customers is facing a problem with efficiently provisioning bare metal servers to their High Performance Computing (HPC) cluster.
The Challenge
This challenge involved provisioning bare metal servers to a High Performance Computing (HPC) cluster, including installing the Operating System and configuring their networking. This customers’ provisioning at the time was performed using Cobbler. Although Cobbler can automate the process of installing OS and network configuration, it is not well integrated with their other existing platforms like NeCTAR and Object Storage. In particular, some information needed to be maintained manually.
The Aptira Solution
OpenStack is highly capable of provisioning flexible infrastructure, including high performance computing clusters and can solve the problems this customer was facing, so we planned to move them to OpenStack Ironic. We designed and deployed a minimal Highly Available OpenStack Cloud with Ironic installed. The solution was delivered by Kolla-Ansible, a tool to deploy a production-ready containerised OpenStack Cloud. The OpenStack instance was integrated with their existing Object storage cluster (Ceph Radosgw) to store bare metal images.
One issue that arose early was the lack of direct Internet access. Internet access is required to download images from the Docker hub. The customer had internet access via http proxy. The corresponding Ansible code for glance-api and ironic-conductor containers don’t support passing http proxy environment variables to the containers, so these two containers deployed by Kolla-Ansible couldn’t access the Internet. Fortunately, Aptira was able to patch the Ansible code to enable support for http proxy and we were able to continue.
Another challenge that arose was using an external Radosgw as the storage backend. Our engineers identified a bug that showed up when Ironic uses Ceph Radosgw as it’s backend. Basically, there was a bug in the format of the endpoint url, specifically this file: ironic/common/glance_service/v2/image_service.py
We were able to identify and correct this defect so that Ironic was able to download images from the Ceph Radosgw backed image service.
The Result
We successfully delivered an efficient mechanism for provisioning the customer’s HPC Bare metal servers, using the Highly Available OpenStack Cloud with ironic service that we implemented.
Aptira supported the customer post-deployment during their product test phases and catered to new requirements not specified in their original requirements documents. This is the nature of agile/flexible development so this was not unexpected or a problem. For example, the customer realized that they needed to set MTU 9000 using fixed IP address. Also, the customer needed to building GPT images for their bare-metal nodes.
After first passing extensive testing in the customer’s test environment, the solution was deployed in production and the customer is now completing their final testing of the bare metal High Performance Computing cluster.