When the Client Reports the Production is Down at Mid Night
It was Sunday night. On Saturday night I upgraded the production environment with the latest version and the client verified the system working properly on Sunday morning. I spend the day so freely and went to the bed around 11 p.m.
At the same time, I got a call from one of our developers, the system can not be accessed. I quickly check out the environment and noticed the system is completely down! It is a system of a finance company and the system must up before Monday morning.
I was just confused because it is still 3 months since I have started my job and still, I am unable to figure out architectural issues. The environment is Kubernetes and the following was the error.
kubectl get pods -o wide
The connection to the server <apiserver_advertise_ip>:6443 was refused - did you specify the right host or port?
I opened a google meet with two other Devops Engineers and tried to figure out the cause. First, we thought it is an error due to closing the ports. Also, we tried restarting kubelet service but couldn't come to a proper solution until 1.00 a.m.
At 1.00 a.m we just checked the kubelet logs and realized all happened because the Kubernetes certificate is expired. Kubernetes had to tell us this in a proper way. We were using v1.14.10 so the certificate updating procedure was pretty long.
It was almost 2.00 a.m when we find the steps to update certificates. There were four of us and we didn't forget to keep in contact with the client who started to test production at 11.00 p.m :).
Following blog made by one of our early developers was very useful to us to finish the task. At 3.30 a.m we were able to up the production environment back.
I was almost hungry and sleepy. The best lesson, I made a calendar reminder to update the Kubernetes certificate before it expires next year.
HAPPY BIRTHDAY ALE
Hola! It is our lovely friend @alejos7ven 's birthday..
Join with me to wish a very happy birthday to @alejos7ven and a good future!