Once In A Blue Moon Bug? HAProxy Routing Phantom Connections To Redis Sentinel Nodes In Down State

Twice in the past couple of years I’ve encountered an unusual issue whereby after a failover in a Redis Sentinel cluster, HAProxy has seemed to continue to route phantom connections to a node that is no longer the Primary/writeable node. Our clusters use HAProxy in conjunction with a KeepaliveD floating IP/VIP hostname which all traffic is routed via, and HAProxy in turn monitors the back ends and routes all traffic to the node that currently reports back as being the Primary/writeable node.

When this has happened as far as GUI, logs and stats info is concerned the failover has been detected successfully and all log entries indicate all traffic is being routed to the new Primary, there is zero evidence of connections being sent to other back ends. We have only been alerted to the problem by other application logs starting to intermittently generate errors indicating that write requests are hitting read-only nodes in the Redis cluster:

No connection (requires writable - not eligible for replica) is active/available to service this operation

A restart of HAProxy alone does not cure the issue, it takes a full stop and start of the service in order to clean slate it. This is something I only discovered by eventual trial and error the first time I encountered the issue.

We have a number of HAP/Redis clusters in our environment, so we probably have upwards of 2-3 dozen failovers of this type during patching each month, which theoretically have the potential to create the error condition (although other wise the services are stable and other failovers out with patching are rare), so it’s a bit of a needle in a haystick trying to replicate/catch debugging info.

The config is relatively boilerplate for Redis Sentinel (we currently run standard ports in parallel to TLS ports as our Dev Teams are in the process of transitioning across, and the boxes are internal to our network).

frontend REDACTED_prd_redis_frontend
description REDACTED Service Redis Prod
bind *:6379
mode tcp
option tcplog
default_backend REDACTED_prd_redis_backend

backend REDACTED_prd_redis_backend
mode tcp
balance roundrobin
server REDACTED03 REDACTED03.local:6379 check inter 1s
server REDACTED04 REDACTED04.local:6379 check inter 1s
server REDACTED05 REDACTED05.local:6379 check inter 1s
option tcp-check
tcp-check send info\ replication\r\n
tcp-check expect string role:master

frontend REDACTED_prd_redis_tls_frontend
description REDACTED Service Redis over TLS Prod
bind *:16379
mode tcp
option tcplog
default_backend REDACTED_prd_redis_tls_backend

backend REDACTED_prd_redis_tls_backend
mode tcp
balance roundrobin
server REDACTED03 REDACTED03.local:16379 check-ssl verify none check inter 1s
server REDACTED04 REDACTED04.local:16379 check-ssl verify none check inter 1s
server REDACTED05 REDACTED05.local:16379 check-ssl verify none check inter 1s
option tcp-check
tcp-check send info\ replication\r\n
tcp-check expect string role:master

These HAProxy instances are running on standard Alma 9/RHEL repo versions, are patched regularly, and throughout the lifetime of the issue have gone through I believe three updated versions.

2.4.22-3.el9_3
2.4.22-3.el9_5.1
2.4.22-4.el9

Just to mention a few months ago I also had a single instance of a similarly unusual issue on a 3 RMQ cluster with HAProxy configured to sent traffic round robin whereby after sequential patching and rebooting of the nodes HAProxy was failing to route traffic to the nodes after they came back up, in spite of the logs detecting the node as being back in service, which again needed a full stop an start of HAProxy to resolve.

I’ve had a look at the HAProxy bugs page for 2.4.22 and can’t haven’t spotted anything which looks like it relates (and anything of high importance from later versions of 2.4.x theoretically should have been backported by Red Hat), and I haven’t been able to find any other reports of a similar issue from searching the net generally.

As a next step in trying to mitigate I intend to add the following params to my config to try to tighten up any potential lingering connections to nodes in DOWN state:

on-marked-down shutdown-sessions
on-marked-up shutdown-backup-sessions

But is this an issue anybody else has encountered before?

Thanks in advance!

1 post - 1 participant

Read full topic

Once In A Blue Moon Bug? HAProxy Routing Phantom Connections To Redis Sentinel Nodes In Down State

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...