K8S cluster problem one-controller node in kuboard sharding update error

Posted May 25, 20202 min read

Recently in the K8S cluster, composed of 3 master nodes, using kuboard(very suitable for my little white) as a graphical interface. In kuboard, I saw many controller node fragment update errors. Screenshot below:

image.png

The general meaning is:when updating a node, a service node cannot be found. I basically searched the Internet and found no similar problems and solutions. Here is a summary.

Analysis and solution:

  1. Please pay attention to the controller, this keyword should be the management control component controller-manager, then look at the controller-manager, execute:

    kubectl get pods -n kube-system -o wide

image.png

The controller-manager has restarted a total of 23 times. Could it be that the node cannot be found due to the restart?
2. With doubt, why do you want to restart next? carried out:

kubectl describe pods/kube-controller-manager-k8s-master3 -n kube-system

image.png

After reading carefully, describe will record information about pods in detail, and you can also see the number of restarts. I am here to find the last restart time, and then look at the log of the corresponding time. What is the reason for restarting? Since the pod automatically restarts and returns to normal, no specific error log can be seen. At the same time, I noticed that the controller-manager has a Liveness survival check. It is checked every 10S. By this, you can probably guess that the error that the update node cannot be found is often reported in kuboard(there are many normal update node log outputs). Frequent node updates should be regular health checks.

  1. Next, I want to see the specific log and execute:

    kubectl logs -f kube-controller-manager-k8s-master3 -n kube-system --tail 100

image.png

However, this is already the log after restarting, intercepting some of the logs reported by the update node error, I really don't see what the problem is, to say that the problem may be apiserver interface request timeout. This log looks really strenuous, _ log level is not, and how can I view the log last restart? _ There is no way to continue the investigation here.

  1. Then I looked at the next few nodes, did they all report this error, and accidentally found that one of the nodes did not report this error. Compare the system version and kernel of the node. Found that the system version is:centos 7.7, the kernel version is 4.4. Look at the error node system version:7.3 kernel version:4.4. I searched the kuboard documentation and recommended centos7.6 or above. So delete the node, update the system version to the latest 7.8(note that after updating the system version, the system kernel is 3.10 by default, and then upgrade the kernel to 4.4. Otherwise, you will encounter some problems and more pits). Error updating Endpoint The error report of the update node is temporarily gone, so the update error is reported because of the system kernel problem.

to sum up

The K8S system version is recommended to be upgraded to 7.6 or higher, and the kernel version is of course also recommended 4.4.
I really don't see what is going on in the restart log. If some students encounter, please discuss and ask your questions.
Afterwards, we will continue to summarize and share the problems, thank you ~