@eedwards-sk wrote:
I’m attempting to use HAProxy Resolvers along with SRV Records and server-template to allow services on dynamic ports to register with HAProxy.
I’m using AWS Service Discovery (with Route53, TTL: 10s) and ECS.
It works successfully, given enough time, and any services in the DNS record eventually become available backends.
If I have 2 containers running for a service, with 4 defined using server-template, then the first 2 will be “green” and the second two will be “red”.
During an HA deployment, where the 2 containers are replaced 1 by 1, HAProxy fails to register the updated records in time to prevent an outage.
So e.g. during a deployment, you might have an SRV record with 2 results:
_foo_.my.service: - A._foo.my.service - B._foo.my.service
as the first container (A) is stopped, the SRV record only returns 1 result:
_foo_.my.service: - B._foo.my.service
at this point, I would expect HAProxy to remove the server from the server list, and it would appear “red” similar to other servers that were missing when the service started
However, instead, the server ends up marked as “MAINT” (orange), due to “resolution”, and will sit “stuck” for up to 5+ minutes sometimes, failing to acquire the new IP information.
Meanwhile, the SRV record is updated again as the services are replaced/updated:
_foo_.my.service: - B._foo.my.service - C._foo.my.service
then again as B is removed:
_foo_.my.service: - C._foo.my.service
and finally D is added:
_foo_.my.service: - C._foo.my.service - D._foo.my.service
This whole time, performing a
dig SRV _foo_.my.service @{DNS_IP}
on the haproxy host IMMEDIATELY resolves the correct service IPs and Ports as each of the above deployment steps happens. So the issue isn’t with upstream DNS being up-to-date.This makes the SRV system basically useless to me currently, as even with a rolling deployment with HA services, I end up with an outage.
I have 2 HAProxy servers and the behavior is not identical between them, either (even though they’re identically configured).
Whether one of the server entries stays in “MAINT” for long seems to vary between them.
Eventually, it ends up resolving – but having to wait 5+ minutes and having the services go completely unavailable (even though they’re up, dns is updated, and they’re ready to receive traffic) is not adequate for production usage.
here’s a sanitized and trimmed config excerpt:
global log /dev/log local0 log /dev/log local1 notice chroot /var/lib/haproxy stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners stats timeout 30s user haproxy group haproxy daemon # Default SSL material locations ca-base /etc/ssl/certs crt-base /etc/ssl/private # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384 ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256 ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11 no-tls-tickets spread-checks 5 defaults log global mode http option httplog option dontlognull timeout connect 5000 timeout client 50000 timeout server 50000 errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http option httpclose monitor-uri /elb-check maxconn 60000 rate-limit sessions 100 backlog 60000 resolvers aws-sd accepted_payload_size 8192 hold valid 5s # keep valid answer for up to 5s nameserver aws-sd1 169.254.169.253:53 listen stats bind 0.0.0.0:9000 mode http balance stats enable stats uri /stats stats realm HAProxy\ Statistics frontend HTTP_IN bind 0.0.0.0:80 capture request header User-Agent len 200 capture request header Host len 54 capture request header Origin len 54 capture request header X-Forwarded-For len 35 capture request header X-Forwarded-Proto len 5 capture response header status len 3 option http-server-close option forwardfor except #sanitized# option forwardfor except #sanitized# # environments acl dev hdr_beg(host) #sanitized#. #sanitized#. # web-services routes acl locations path_beg /locations # dev backend use_backend DEV_HOME if dev !locations use_backend DEV_LOCATIONS if dev locations backend DEV_HOME balance roundrobin option httpchk GET /healthcheck http-check expect status 200 default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2 server-template web 4 _http._tcp.web-service-home-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4 backend DEV_LOCATIONS balance roundrobin option httpchk GET /locations/healthcheck http-check expect status 200 default-server inter 10s downinter 2s fastinter 2s rise 5 fall 2 server-template web 4 _http._tcp.web-service-locations-dev-web.my.service resolvers aws-sd check init-addr none resolve-opts allow-dup-ip resolve-prefer ipv4
Posts: 3
Participants: 1