RSSD Cloud Disk Mounting Problems

In general, when mounting a cloud disk, it is necessary to ensure that it is under the same availability zone as the host. However, for RSSD cloud disks, in addition to the availability zone, it is also necessary to ensure that they are in the same RDMA cluster. RDMA is a concept smaller than availability zone, it is hidden in the console interface, and can only be specified and queried through the API.

Overall, an RSSD cloud disk has the following requirements for the host:

Host must be O-type, i.e., Kuaijie-type host.
The host and cloud disk are in the same availability zone.
The host and cloud disk are in the same RDMA.

If the first two points are met but the third is not, the mounting will fail. The general error message for failure is (focus on error code 17218):


[17218] Command failed: add_udisk.py failed

Another issue with RDMA is that it may change at any time, i.e., the host may migrate the RDMA of the cloud disk or the host, and it cannot notify downstream services after changing.

When CSI deals with Pods using RSSD cloud disks, it needs to solve the following two problems:

When creating a new RSSD cloud disk, make sure its RDMA is consistent with the node where the Pod is located.
When re-scheduling a Pod using an RSSD cloud disk, make sure that the scheduled node is O-type, and its RDMA is consistent with the cloud disk.

Generally, users don’t need to worry about these issues. However, due to historical design reasons of CSI, versions prior to 22.09.1 of CSI will encounter serious mount failure problems if RDMA migration occurs when dealing with RSSD cloud disks. Next, I will provide a detailed interpretation of CSI’s scheduling mechanism for RSSD cloud disks to help you better understand this issue and know why it is essential to upgrade CSI to version 22.09.1 or above when using RSSD cloud disks.

Static Scheduling

This is the plan adopted by CSI in versions below 22.09.1.

In csi before 22.09.1, it is achieved by adding nodeAffinity in pv. For more information about node affinity, please see the official documentation: Assign Pods to Nodes using Node Affinity .

For Kuaijie-type cloud hosts, a topology label is saved on the node to save the RDMA field, such as:


topology.udisk.csi.ucloud.cn/rdma-cluster-id: 9002_25GE_D_R006

This indicates that this node is in the RDMA cluster 006.

For RSSD cloud disk PV, RDMA will be saved in its nodeAffinity:


  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: topology.udisk.csi.ucloud.cn/rdma-cluster-id
              operator: In
              values:
                - 9002_25GE_D_R006

These two fields are written by CSI and cannot be changed once written. When scheduling, the node affinity mechanism ensures that Pods using RSSD cloud disks are only scheduled to nodes that match RDMA:

If there is no change in RDMA, there is no problem, but once RDMA migration occurs, UK8S cannot detect this migration, nor can it update the information above, causing data inconsistency.

Even if UK8S can detect this migration, because nodeAffinity is immutable, we cannot update the information by updating the field.

And if inconsistencies occur, serious problems will occur. Assuming that the actual RDMA of a cloud disk is 005, but the RDMA saved in UK8S is 006, according to the node affinity, the Pod using it will be scheduled to a node with RDMA 006, which does not match the actual RDMA. This will ultimately lead to a failure in cloud disk mounting.

If your CSI version is less than 22.09.1, RSSD cloud disk mounting issues occur, and you find error code 17218 in the CSI log, then the problem is most likely caused by RDMA migration.

It can be seen that we should not store RDMA information in the UK8S cluster in any way as this information is completely unreliable. We should obtain RDMA dynamically for scheduling.

Dynamic Scheduling

This is the plan adopted in CSI version 22.09.1 and above (this version requires the Kubernetes version to be no less than 1.18).

In CSI version 22.09.1 and above, node affinity will no longer be used to schedule RSSD cloud disks, and all RDMA information will be dynamically obtained and scheduled. Considering no stock data, the issue of RDMA migration can be solved through dynamic scheduling.

To implement dynamic scheduling, dynamic logic needs to be inserted in at least two places:

When creating an RSSD cloud disk, dynamically obtain the RDMA of the node.
When scheduling RSSD cloud disks, dynamically obtain the RDMA of the cloud disk and nodes for matching.

The following two sections will introduce how CSI solves the above problems.

Creating RSSD Cloud Disk

Creating an RSSD cloud disk is implemented in CSI, so we only need to change the logic of creating a disk in CSI:

Remove the topology label: topology.udisk.csi.ucloud.cn/rdma-cluster-id.
When creating a new PV, stop writing the nodeAffinity, topology.udisk.csi.ucloud.cn/rdma-cluster-id.
When creating RSSD cloud disks, call the API to get the RDMA information of the node and pass it to the RSSD cloud disk creation interface.

In this way, newly created RSSD cloud disks will no longer rely on node affinity for scheduling.

Scheduling RSSD Cloud Disk

The scheduler needs to be able to dynamically obtain the RDMA information of the RSSD when scheduling a Pod that includes RSSD. This requires calling the XXXCloud API. The native kube-scheduler can not implement this, so we need to complete it with the extension mechanism provided by the kube-scheduler.

Here is a brief explanation, KubeScheduler provides two mechanisms for extending scheduling, namely Extender mechanism and Framework mechanism. Below is a brief explanation of their differences:

scheduler extender: The scheduling plugin needs to be deployed on the master node. During scheduling, the scheduler will call extensions at different extension points via HTTP.
scheduler framework: Compile a standalone scheduler completely, you can insert your own scheduling logic in it. Deployed in the cluster as a Deployment separately. If you want to use this scheduler, you need to modify the schedulerName setting.

Generally speaking, the official recommends using the second method to extend the scheduler, but we do not want to modify schedulerName, and the second method means that we need to maintain different versions of schedulers for different versions of Kubernetes, which will be more troublesome to maintain later on, so we use the first extension method.

This requires deploying a separate HTTP service on the master node of the cluster to implement its own scheduling logic, and then, the scheduling configuration needs to add extenders related content:


extenders:
  - urlPrefix: http://127.0.0.1:6678/
    filterVerb: filter
    httpTimeout: 60s

This means a filter extension of scheduling, the extension calls the http://127.0.0.1:6678/filter interface.

Special note: The extenders feature was introduced after the scheduler's v1beta2 version. So if the Kubernetes version is lower than 1.19, we can not extend it. If you are using a lower version of Kubernetes, you should upgrade the Kubernetes version first.

This way, we can call the XXXCloud API at the extension point to dynamically obtain the RDMA Cluster information of RSSD and the node, and filter the node based on this information:

The scheduler-extender will check whether the Pod is using the RSSD’s PV. If so, it will call the XXXCloud API to get the RDMA information of the PV and nodes, and filter out inconsistent nodes based on this information.

When deploying, you need to add a new systemd on the master node of your cluster, called scheduler-extender-uk8s. You can check the service health with the following command:


systemctl status scheduler-extender-uk8s

If there is a problem with the scheduling, you can check the log with the following command:


journalctl -u scheduler-extender-uk8s.service -f

When installing a cluster of 1.19 and above, uk8s will automatically install scheduler-extender-uk8s on all master nodes. For how to handle existing clusters, please refer to the content below.

Upgrading CSI and Installing scheduler-extender via Console

It must be noted here that the new version of CSI must be used in combination with the scheduler-extender. So when upgrading the 22.09.1 CSI, please operate in the console instead of modifying the image manually.

The new CSI version will integrate the scheduler-extender version, formatted as {csi-version}-se{scheduler-extender-version}, for example, 22.09.1-se22.08.3 indicates that the CSI version is 22.09.1 and the scheduler-extender version is 22.08.3. If the scheduler-extender is not installed in the cluster, the format is {csi-version}-se-unknown, such as 21.09.1-se-unknown.

By using this integrated version number, CSI and scheduler-extender can be easily binded together, and there is no need to add a separate scheduler-extender plugin management page.

Query Version

You can query the CSI version directly by checking the image in the StatefulSet. Querying the scheduler-extender version is more complex because it requires logging into the master node of the cluster and calling the command. The scheduler-extender is deployed through systemd.

Here, asynchronous tasks need to be involved. We need to deploy Jobs on the master node to complete the scheduler-extender version query. The original CSI version query was completed synchronously, and it needs to be changed to asynchronous call, similar to CNI version query.

Version Upgrade Inconsistency Issue

From now on, scheduler-extender needs to be upgraded or installed before upgrading CSI when upgrading CSI. This is because the new CSI must rely on scheduler-extender to function. If the upgrade or installation of scheduler-extender fails, the entire CSI upgrade process must be halted.

The possible inconsistency here is that scheduler-extender was installed successfully, but CSI was not upgraded successfully. This means that the two constraints for scheduling the RSSD PV in the cluster coexist through the scheduler-extender and nodeAffinity. Both constraints ensure that the RSSD PV is scheduled to a node that is consistent with RDMA Cluster, but one is dynamic and one is static. The coexistence of two constraints is actually not conflicting.

In summary:

We can tolerate the success of scheduler-extender installation, but CSI upgrade failure. Because this is equivalent to having two constraints at the same time, let the customer retry upgrading CSI subsequently.
It can’t be tolerated to upgrade CSI in the situation where the scheduler-extender is not installed, because it means there are no constraints on RSSD PV in the cluster.

Therefore, we didn’t do the upgrade fail rollback, just make sure to upgrade scheduler-extender first, then upgrade CSI.

In addition, when upgrading CSI, we also added the following constraints:

If the cluster version is less than 1.19.x, CSI upgrade is not allowed, customers need to upgrade the cluster version first.
CSI upgrade is not allowed if there are PVs in the customer’s cluster that includes RDMA nodeAffinity. This requires manual intervention to hack data (see the following section on handling historical stock data).

Dealing with Historical Stock Data

The above has solved all problems of creating new RSSD cloud disks, but lacks processing for stock data.

Generally, what comes to mind is to remove the nodeAffinity data on PV after the upgrade completes. However, Kubernetes has a quite troublesome design, and that is nodeAffinity is immutable, and cannot be modified directly. As long as nodeAffinity exists, kube-scheduler will constrain PV based on node affinity. When disk migration occurs, problems will arise.

We need to use special means to remove nodeAffinity on PV. Here, the means are very hack, and it requires directly modifying the data in etcd. Therefore, this will be a very unmapped operation, and cannot be integrated into automated tools. It requires manual intervention.

If your cluster contains RSSD cloud disk data, please contact our technical support. We will manually fix the data for you.