Hello,
I am having an issue that I can’t seem to figure out. I’m not convinced it is HAProxy - but I need to eliminate all possibilities.
I’ve used this configuration for more than 4 years now. In the past month, I have changed data centers - but have a mostly identical hardware configuration. I’ll try to walk through everything as detailed as possible.
Configuration:
global
log /dev/log local0 debug
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
stats timeout 30s
user ion
group ion
daemon
ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
ssl-default-bind-options no-sslv3
frontend http-in
bind *:80
mode http
acl http ssl_fc,not
http-request redirect scheme https if http
log global
frontend app
bind *:443 ssl crt <REDACTED>
option forwardfor
log global
option http-keep-alive
timeout http-keep-alive 1000
mode http
acl app3 var(txn.txnhost) -m str -i app3.ion-k12.com
acl aclcrt_APPFrontEnd var(txn.txnhost) -m reg -i ^([^\.]*)\.website\.com(:([0-9]){1,5})?$
acl aclcrt_APPFrontEnd var(txn.txnhost) -m reg -i ^website\.com(:([0-9]){1,5})?$
acl api var(txn.txnhost) -m str -i api.ion-k12.com
acl aclcrt_APIFrontEnd var(txn.txnhost) -m reg -i ^api\.website\.com(:([0-9]){1,5})?$
acl public var(txn.txnhost) -m beg -i public.website.com
acl app var(txn.txnhost) -m str -i testing.website.com
acl aclcrt_TestingFrontEnd var(txn.txnhost) -m reg -i ^([^\.]*)\.website\.com(:([0-9]){1,5})?$
acl aclcrt_TestingFrontEnd var(txn.txnhost) -m reg -i ^website\.com(:([0-9]){1,5})?$
http-request set-var(txn.txnhost) hdr(host)
default_backend appservers
option httplog
option logasap
http-request capture req.hdr(Content-Length) len 15
backend appservers
mode http
log /dev/log local0 debug
balance roundrobin
timeout connect 300s
timeout server 300s
retries 3
server APP01 192.168.10.50:80 id 10101 check port 80 inter 1000
server APP02 192.168.10.51:80 id 10102 check port 80 inter 1000
defaults
log global
mode http
option tcplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
APP01:
IIS
Server 2022
128 GB Memory
Intel X550 (Dual) NIC
APP02
IIS Configured Identically
Server 2022
128 GB Memory
Mellanox ConnectX-4 LX NIC
When APP01 and APP02 are both active - we get intermittent 504 errors. When isolated, APP02 is the source of the trouble. However, when I browse to the site on APP02, it performs normally. When I browse to it from APP01, it performs normally. It’s just when APP02 is in the LB cluster that it 504’s.
I’m wondering if there is something about the Mellanox NIC that HAProxy doesn’t like. I can’t find any notes, or documentation or anything like that, so I have no way of confirming. It should just work, right? A NIC is a NIC.
Here are a few of the logs. I see that it’s timing out after 30 seconds with a sH - but it shouldn’t be - especially when the server is responsive locally.
Nov 14 14:01:42 k17-ru21 haproxy[506882]: 69.135.X.X:28175 [14/Nov/2023:14:01:12.825] app~ appservers/APP02 0/1/30029 198 sH 24/22/3/3/0 0/0
Nov 14 14:01:42 k17-ru21 haproxy[506882]: 69.135.X.X:16805 [14/Nov/2023:14:01:12.928] app~ appservers/APP02 0/0/30040 198 sH 24/22/2/2/0 0/0
Nov 14 14:01:43 k17-ru21 haproxy[506882]: 69.135.X.X:37404 [14/Nov/2023:14:01:12.865] app~ appservers/APP02 0/1/30163 198 sH 24/22/1/1/0 0/0
Nov 14 14:01:43 k17-ru21 haproxy[506882]: 69.135.X.X:59170 [14/Nov/2023:14:01:13.211] app~ appservers/APP02 0/0/30098 198 sH 24/22/0/0/0 0/0
Is it possible that there is some kind of routing, or network something going on between my HAProxy and my APP02 that would be causing these 504’s? I mean - with a direct connection to APP02, most requests are being served in milliseconds. But introduce HA - 30+ seconds.
I’m at my wits end here…
Thanks for any help or info anyone can provide.
1 post - 1 participant