CNI Ipamd Preallocation of VPC IP Implementation Principle and Deployment Architecture

Background and Principle

Due to the current situation of the underlying network technology of XXXCloud, after a new application of the Pod’s VPC IP, it needs to conduct arping first to ensure that the flow table has been issued and there are no conflicts. This process takes at least 5s and can take up to 15s. That is to say, it takes at least 5s to create a Pod, plus image pull and various initialization operations, the Pod creation time is basically prolonged to more than 10s on average. This is almost unacceptable for Pods that require rapid launch and destruction.

Moreover, because the creation and destruction of all Pods require calls to the VPC service, when the VPC service is unreachable due to various factors, it will cause the Pod to be unable to be created or destroyed, the cluster is in a state of almost unavailability, and the release process will be completely blocked.

To solve the above problems, at the CNI level, a batch of VPC IPs can be pre-allocated and a VPC IP pool can be maintained to allocate the VPC IP in the IP pool to the newly created Pod. Because the IP in the pool has been pre-applied from the UNetwork API and completed the arping operation, it can be immediately assigned to the Pod, reducing the Pod creation time by 5-15s.

In addition, in the case of VPC downtime, the pool can also bypass the VPC service to bear the application and recycling of Pod IP on its own (when the pool has sufficient IP), improving the availability of the cluster.

Detailed Explanation of CNI Preallocation IP Solution

The new version of CNI is divided into two parts:

cnivpc binary file. Used as the entrance to communicate with Kubelet. Kubelet can realize the creation and deletion of Pod network by calling this binary executable file.
ipamd service. A resident daemon process responsible for applying, maintaining and releasing VPC IP provides gRPC API for cnivpc to allocate and release VPC IP through Unix Domain Socket. It is deployed in the UK8S cluster as a DaemonSet. You can understand it as a component similar to Calico IPAM .

The overall architecture is shown in the figure below.

The core process includes:

ipamd internally contains a control loop, which periodically checks from BoltDB (local database) whether the currently available VPC IP is lower than the low watermark of the IP pool. If it is lower than the low watermark, it calls the UNetwork AllocateSecondaryIp API to allocate IP to the node and store the IP in BoltDB. If the VPC IP is higher than the high watermark, call UNetwork DeleteSeconaryIp to release the excess IP in BoltDB.
ipamd provides the following three gRPC interfaces to cnivpc through unix:/run/cni-vpc-ipamd.sock:
- Ping: ipamd service availability probe interface. cnivpc will call this interface every time before asking for and releasing IP. If it fails, cnivpc’s workflow will degrade back to the original solution.
- AddPodNetwork: An interface to assign IP to the Pod. If there is an available IP in the BoltDB IP pool, the IP will be allocated to the Pod directly from the IP pool; otherwise, an IP will be applied directly from the UNetwork API, and in this case, it still takes 5-15 seconds to start the Pod.
- DelPodNetwork: Interface to release Pod IP. After the Pod is destroyed, the IP will enter the cooling state. After cooling for 30s, it will be put back into the IP pool.
If the ipamd service is killed, it will respond to the SIGTERM signal sent by Kubelet, stop the gRPC service and delete the corresponding Unix Domain Socket file.
The ipamd service is an optional component, even if it terminates abnormally, cnivpc can work normally, but it will lose the ability of pre-allocation.
If ipamd finds that the IP of VPC has been allocated, it will try to borrow IP from the pool of other ipamd in the same subnet. If other ipamd do not have available IP, creating the Pod will report an error.

—availablePodIPLowWatermark=3: VPC IP pre-allocation low watermark, unit: individual. The default is 3.
—availablePodIPHighWatermark=50: VPC IP pre-allocation high watermark, unit: individual. The default is 50.
—cooldownPeriodSeconds=30: VPC IP cooling time. After the Pod IP is returned, it needs to be cooled before it can be put back into the pool. This is to ensure the destruction of the route, in seconds. The default is 30s.

Deployment method

Directly deploy cni-vpc-ipamd.yml in the cluster.

Check if ipamd is started:


# kubectl  get pod -o wide  -n kube-system -l app=cni-vpc-ipamd
NAME                  READY   STATUS    RESTARTS   AGE   IP              NODE            NOMINATED NODE   READINESS GATES
cni-vpc-ipamd-6v6x8   1/1     Running   0          59s   10.10.135.117   10.10.135.117   <none>           <none>
cni-vpc-ipamd-tcc5b   1/1     Running   0          59s   10.10.7.12      10.10.7.12      <none>           <none>
cni-vpc-ipamd-zsspc   1/1     Running   0          59s   10.10.183.70    10.10.183.70    <none>           <none>

Notice: ipamd will be installed by default in clusters with version 1.20 or above

Debug

In a cluster where ipamd is installed, you can use the cnivpctl command to view the pool situation in the cluster.

Log in to any node (can be master or node), the following command can list the nodes in the cluster that use ipamd, and the number of IPs in their pool:


$ cnivpctl get node
NODE            SUBNET              POOL
192.168.45.101  subnet-dsck39bnlhu  9
192.168.47.103  subnet-dsck39bnlhu  4

You can see that there are currently two nodes in my cluster, with 9 and 4 IPs in the pool, respectively.

By cnivpctl get pool you can further view all IPs in a node pool:


$ cnivpctl -n 192.168.45.101 get pool
IP              RECYCLED  COOLDOWN  AGE
192.168.32.35   21h       false     21h
192.168.34.138  21h       false     21h
192.168.35.38   21h       false     21h
192.168.36.86   21h       false     21h
192.168.43.106  21h       false     21h
192.168.43.227  <none>    false     21h
192.168.45.207  21h       false     21h
192.168.45.229  <none>    false     21h
192.168.45.59   <none>    false     21h

Without adding -n parameter, you can see all pool IPs, through -o wide you can also list nodes simultaneously:


$ cnivpctl get pool -owide
IP              RECYCLED  COOLDOWN  AGE  NODE
192.168.32.35   21h       false     21h  192.168.45.101
192.168.34.138  21h       false     21h  192.168.45.101
192.168.35.38   21h       false     21h  192.168.45.101
192.168.36.86   21h       false     21h  192.168.45.101
192.168.43.106  21h       false     21h  192.168.45.101
192.168.43.227  <none>    false     21h  192.168.45.101
192.168.45.207  21h       false     21h  192.168.45.101
192.168.45.229  <none>    false     21h  192.168.45.101
192.168.45.59   <none>    false     21h  192.168.45.101
192.168.40.121  21h       false     21h  192.168.47.103
192.168.43.73   <none>    false     21h  192.168.47.103
192.168.44.59   <none>    false     21h  192.168.47.103
192.168.45.19   <none>    false     21h  192.168.47.103

cnivpctl get pod can see which IPs the Pod on a node occupies:


$ cnivpctl -n 192.168.47.103 get pod
NAMESPACE    NAME                               IP              AGE
kube-system  coredns-798fcc8f9d-jzccm           192.168.42.69   21h
kube-system  csi-udisk-controller-0             192.168.43.110  21h
kube-system  metrics-server-fff8f8668-rgfmr     192.168.36.42   21h
default      nginx-deployment-66f49c7846-gtpbk  192.168.40.14   21h
default      nginx-deployment-66f49c7846-mfgp4  192.168.34.55   21h
default      nginx-deployment-66f49c7846-s54jj  192.168.40.126  21h
kube-system  uk8s-kubectl-68bb767f87-tpzng      192.168.42.53   21h

The output of this command is a bit similar to kubectl get pod, but will only output Pods with VPC IP and will not list HostNetwork or Pods using other network plug-ins.

Similarly, without adding -n parameter, you can also list all Pods:


$ cnivpctl get pod -owide
NAMESPACE    NAME                               IP              AGE  NODE
kube-system  coredns-798fcc8f9d-gdrbq           192.168.41.246  21h  192.168.45.101
kube-system  coredns-798fcc8f9d-jzccm           192.168.42.69   21h  192.168.47.103
kube-system  csi-udisk-controller-0             192.168.43.110  21h  192.168.47.103
kube-system  metrics-server-fff8f8668-rgfmr     192.168.36.42   21h  192.168.47.103
default      nginx-deployment-66f49c7846-gtpbk  192.168.40.14   21h  192.168.47.103
default      nginx-deployment-66f49c7846-mfgp4  192.168.34.55   21h  192.168.47.103
default      nginx-deployment-66f49c7846-s54jj  192.168.40.126  21h  192.168.47.103
kube-system  uk8s-kubectl-68bb767f87-tpzng      192.168.42.53   21h  192.168.47.103

In addition to the above commonly used cnivpctl commands, there are some advanced usage, such as:

cnivpctl get unuse: List leaked IPs.
cnivpctl pop <node> [ip]: Pop an IP from the specified node’s pool.
cnivpctl push <node> [ip]: Assign a new IP to the specified node’s pool.
cnivpctl release <node> [ip]: Release the leaked IP in the specified node (dangerous operation, execute with caution).

For more detailed usage of this command, refer to cnivpctl -h.

Frequently Asked Questions

Q: How to configure the size of the VPC IP pool water level?

A: It can be configured through the entry parameters—availablePodIPLowWatermark and —availablePodIPHighWatermark of the ipamd program, for example:


      containers:
        - name: cni-vpc-ipamd
          image: uhub.ucloud-global.com/uk8s/cni-vpc-ipamd:1.2.3
          args:
            - "--availablePodIPLowWatermark=3"
            - "--availablePodIPHighWatermark=50"
            - "--calicoPolicyFlag=true"
            - "--cooldownPeriodSeconds=30"

Notice: If the VPC IP pool watermark is low and a large number of Pods are suddenly scheduled to the node, the newly created Pods will use the latest VPC IP applied from the UNetwork API when the available IP of the VPC IP pool is exhausted. At this time, the Pod still needs several seconds to access the hosting area. If the high watermark of the VPC IP pool is high and the number of cluster nodes is large, it may cause the subnet space IP to be consumed and new VPC IPs cannot be allocated.

In addition, please ensure that availablePodIPLowWatermark is less than or equal to availablePodIPHighWatermark, otherwise ipamd will report an error when starting!

Q: After the VPC service failed, my creation and destruction of Pods will be affected, how to configure ipamd to eliminate this impact?

A: Yes, if there is no ipamd, once the VPC service encounters a problem, your cluster Pod creation and destruction will not be able to proceed. Part of the reason for the design of ipamd is to solve this problem.

However, if the configuration of ipamd is unreasonable and the number of IPs that reside in the pool is too small, ipamd will still be unable to bear the IP allocation task of the Pod after VPC is down.

If you have a high availability requirement for your cluster and hope to use the cluster normally under the situation where the XXXCloud VPC backend system is lost, you can adjust the low watermark parameter availablePodIPLowWatermark of ipamd and set it to the maximum Pod number of your node, such as 110. In this way, ipamd will pre-allocate enough IPs to bear all the creation and destruction of Pods on the current node. For how to adjust the water level, refer to the previous section.

Although ipamd will allocate a lot of IPs at the beginning, it can then fully manage the Pod IP.

Notice: Before you do this, please make sure that the subnet where the node is located is large enough. Otherwise, ipamd cannot pre-allocate the expected number of IPs due to insufficient IPs.

Q: ipamd occupies too many IPs!

A: Thanks to the borrowing mechanism of ipamd, even if the IP of your subnet is consumed by ipamd, ipamd can schedule IP mutually between ipamd. So don’t worry about IP scheduling problems.

If you really don’t want ipamd to occupy so many IPs, you can modify the two water level parameters of ipamd. In the most extreme case, you can modify them to 0, and ipamd will not pre-allocate any IPs.

Q: If the node BoltDB file (/opt/cni/networkstorage.db) is damaged, will it cause VPC IP leakage?

A: Yes. If this happens, you can log in to any node and use the following command to scan and list the leaked IPs on a node:


cnivpctl get unuse -n xx.xx.xx.xx

After listing and confirming that these IPs are not used, use the following command to clean up and release the leaked IPs:


cnivpctl release xx.xx.xx.xx

The command will list the IPs to be released for your secondary confirmation, please make sure they are not used by any Pod, confirm and ipamd will automatically release these IPs.

Or, you can directly delete the node, and its bound VPC IP will be automatically released.

Q: It seems there is a problem with ipamd running, how to diagnose?

A: The call log of cnivpc is /var/log/cnivpc.log of the Node node; ipamd’s log can be observed through kubectl logs, and can also be found in /var/log/ucloud/ on the Node node.

In addition, the log of kubelet is also often indispensable, login to the Node node to execute


# journalctl -u kubelet --since="12:00"

Observe the running log of kubelet.

kubectl get events is also a good helper for you to troubleshoot and diagnose problems.

If you still can’t locate and solve the problem, please contact the UK8S technical team.