Twice in the past couple of years I’ve encountered an unusual issue whereby after a failover in a Redis Sentinel cluster, HAProxy has seemed to continue to route phantom connections to a node that is no longer the Primary/writeable node. Our clusters use HAProxy in conjunction with a KeepaliveD floating IP/VIP hostname which all traffic is routed via, and HAProxy in turn monitors the back ends and routes all traffic to the node that currently reports back as being the Primary/writeable node.
When this has happened as far as GUI, logs and stats info is concerned the failover has been detected successfully and all log entries indicate all traffic is being routed to the new Primary, there is zero evidence of connections being sent to other back ends. We have only been alerted to the problem by other application logs starting to intermittently generate errors indicating that write requests are hitting read-only nodes in the Redis cluster:
No connection (requires writable - not eligible for replica) is active/available to service this operation
A restart of HAProxy alone does not cure the issue, it takes a full stop and start of the service in order to clean slate it. This is something I only discovered by eventual trial and error the first time I encountered the issue.
We have a number of HAP/Redis clusters in our environment, so we probably have upwards of 2-3 dozen failovers of this type during patching each month, which theoretically have the potential to create the error condition (although other wise the services are stable and other failovers out with patching are rare), so it’s a bit of a needle in a haystick trying to replicate/catch debugging info.
The config is relatively boilerplate for Redis Sentinel (we currently run standard ports in parallel to TLS ports as our Dev Teams are in the process of transitioning across, and the boxes are internal to our network).
frontend REDACTED_prd_redis_frontend
description REDACTED Service Redis Prod
bind *:6379
mode tcp
option tcplog
default_backend REDACTED_prd_redis_backend
backend REDACTED_prd_redis_backend
mode tcp
balance roundrobin
server REDACTED03 REDACTED03.local:6379 check inter 1s
server REDACTED04 REDACTED04.local:6379 check inter 1s
server REDACTED05 REDACTED05.local:6379 check inter 1s
option tcp-check
tcp-check send info\ replication\r\n
tcp-check expect string role:master
frontend REDACTED_prd_redis_tls_frontend
description REDACTED Service Redis over TLS Prod
bind *:16379
mode tcp
option tcplog
default_backend REDACTED_prd_redis_tls_backend
backend REDACTED_prd_redis_tls_backend
mode tcp
balance roundrobin
server REDACTED03 REDACTED03.local:16379 check-ssl verify none check inter 1s
server REDACTED04 REDACTED04.local:16379 check-ssl verify none check inter 1s
server REDACTED05 REDACTED05.local:16379 check-ssl verify none check inter 1s
option tcp-check
tcp-check send info\ replication\r\n
tcp-check expect string role:master
These HAProxy instances are running on standard Alma 9/RHEL repo versions, are patched regularly, and throughout the lifetime of the issue have gone through I believe three updated versions.
2.4.22-3.el9_3
2.4.22-3.el9_5.1
2.4.22-4.el9
Just to mention a few months ago I also had a single instance of a similarly unusual issue on a 3 RMQ cluster with HAProxy configured to sent traffic round robin whereby after sequential patching and rebooting of the nodes HAProxy was failing to route traffic to the nodes after they came back up, in spite of the logs detecting the node as being back in service, which again needed a full stop an start of HAProxy to resolve.
I’ve had a look at the HAProxy bugs page for 2.4.22 and can’t haven’t spotted anything which looks like it relates (and anything of high importance from later versions of 2.4.x theoretically should have been backported by Red Hat), and I haven’t been able to find any other reports of a similar issue from searching the net generally.
As a next step in trying to mitigate I intend to add the following params to my config to try to tighten up any potential lingering connections to nodes in DOWN state:
on-marked-down shutdown-sessions
on-marked-up shutdown-backup-sessions
But is this an issue anybody else has encountered before?
Thanks in advance!
1 post - 1 participant