现象描述
一个很久未访问的k8s 集群,今天再去访问,发现不能用了,执行kubectl 报错
kubectl get nodes -o wide
报错如下:
The connection to the server 192.168.10.xxx:6443 was refused - did you specify the right host or port?
应该是 apiserver 的问题
看下 详细日志:
journalctl -xefu kubelet
一直报下面的错误:
node "kmaster" not found Unable to register node "kmaster" with API server: Post https://192.168.10.24...tion refused eviction manager: failed to get summary stats: failed to get node info: node "kmaster" not found
一直未搞懂哪里的问题,只知道apiserver 没起来导致的,于是看了下 apiserver 的容器日志
发现如下错误
addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: auth entication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
看信息应该是证书过期了,于是搜索了下发现,Kubelet组件证书默认有效期为1年。集群运行1年以后就会导致报 certificate has expired or is not yet valid 错误,导致集群 Node不能于集群 Master正常通信。重启的话k8s就起不来了。
解决方案
重新生成证书
验证证书是否过期:
openssl x509 -noout -text -in /etc/kubernetes/pki/apiserver.crt
1.14后的版本可以使用这个命令查看过期时间,如下:
kubeadm alpha certs check-expiration
kubeadm 安装得证书默认为 1 年,注意原证书文件必须保留在服务器上才能做延期操作,否则就会重新生成,集群可能无法恢复。
先把原配置和证书备份
$cp -rp /etc/kubernetes /etc/kubernetes.bak
如果 kubeadm配置文件找不到了,就先生成一个默认的,然后自行修改:
kubeadm config print init-defaults > kubeadm.yaml
然后根据你自己的实际情况修改:
主要修改:kubernetesVersion、advertiseAddress、imageRepository、serviceSubnet
apiVersion: kubeadm.k8s.io/v1beta2 bootstrapTokens: - groups: - system:bootstrappers:kubeadm:default-node-token token: abcdef.0123456789abcdef ttl: 24h0m0s usages: - signing - authentication kind: InitConfiguration localAPIEndpoint: advertiseAddress: 192.168.10.xxx bindPort: 6443 nodeRegistration: criSocket: /var/run/dockershim.sock name: kmaster taints: - effect: NoSchedule key: node-role.kubernetes.io/master --- apiServer: timeoutForControlPlane: 4m0s apiVersion: kubeadm.k8s.io/v1beta2 certificatesDir: /etc/kubernetes/pki clusterName: kubernetes controllerManager: {} dns: type: CoreDNS etcd: local: dataDir: /var/lib/etcd imageRepository: registry.aliyuncs.com/google_containers kind: ClusterConfiguration kubernetesVersion: v1.18.1 networking: dnsDomain: cluster.local serviceSubnet: 10.244.0.0/16 scheduler: {}
修改完,用以上配置重新生成证书:
kubeadm alpha certs renew all --config=/data/kubeadm.yaml
延期配置之后需要更新配置文件
# 注意:更新配置文件前先以 move 方式备份,或删除配置文件 mv /etc/kubernetes/*.conf /data/kubeconfback/ kubeadm init phase kubeconfig all --config=/data/kubeadm.yaml
之后重启 kube-apiserver,etcd,scheduler,controller 容器
docker ps | grep -v pause | grep -E "etcd|scheduler|controller|apiserver" | awk '{print $1}' | awk '{print "docker","restart",$1}' | bash
或者重启 kubelet
systemctl restart kubelet
过几分钟重新访问集群,已经能正常看到节点信息
参考:https://zhuanlan.zhihu.com/p/133654215