The DNS is not the root of all problems after all! It’s been a few months now since Scality Release Engineering started noticing weird networking issues in Kubernetes. Our CI workloads were seeing unexplained connection-reset errors on communication over HTTP between pods inside the cluster. We were left wondering if the issue could be triggered by the tests themselves (so our software) or by CI components.
We didn’t have a clear understanding of it, so we were left with good old patience and trial and error. As a first step, we configured some bandwidth limitations into the HTTP server, which seemed to show some effect. The issue was no longer happening enough for us to justify investing more time to investigate deeper.
A few weeks ago, we integrated an HTTP proxy cache to reduce repetitive downloads of software dependencies and system packages. Additionally we found a way to cache the content. At this point, the nasty bug started to hit back, and this time it was angry. Out of patience, we had to take out the microscope and really understand it. I put my head down and created a tool to reproduce connection resets in Kubernetes, available on my GitHub repository.
Pulling in more brainpower, we reached out to Google Cloud support, and that’s where we found out that we uncovered a bug in Kubernetes. Following our ticket, a Google employee opened an issue on the official K8s repo, as well as a PR with proper explanation
“Network services with heavy load will cause ‘connection reset’ from time to time. Especially those with big payloads. When packets with sequence number out-of-window arrived k8s node, conntrack marked them as INVALID. kube-proxy will ignore them, without rewriting DNAT. The packet goes back to the original pod, who doesn’t recognize the packet because of the wrong source ip, end up RSTing the connection.”
So, beware of connection reset errors, either in your dev environments or worse, reported by a customer. It might be worth checking that you’re not hitting the issue in your environment. I wrote this code to help you with this task.
Reproducing the issue is easy—basically any internal communication between pods in a cluster sending a fair amount of data triggers it. I think there’s a good chance we hit it due to the type of workload we have in Zenko. The “why” is tricky.
Unfortunately, the PR is still in review, but until it’s fixed for good, there’s a simple workaround that can be found in Docker’s libnetwork project. You can remove it once the proper fix is included in Kubernetes itself.
Now we can go back to blaming the DNS!