AWS strongly recommends against using the stickiness features in their Elastic Load Balancer. Indeed, it’s nice to evenly distribute all requests and not worry about another layer of logic around “why a request went here.”
Unfortunately, in the real world, some applications really want stickiness. For example, at work we run a geoip service, which returns the location of an IP address. All 20+ nodes running this service have a local copy of the database, to avoid having an off-node dependency on the data. Lookups via the tornado-based service are not fast, however. So we toss varnish in front to cache 4+ GB worth of responses. If clients were hopping around between all nodes in an AZ, cache hit rates would suck. Point being, there are times that you want stickiness, and it doesn’t necessarily mean you have poor application design (so stop saying that, Amazon).
Now, when you point your CNAME at a region’s ELB CNAME, you’ll find that it returns 5 or more IP addresses. Some IPs are associated with individual AZs where you have live instances. There is no load balancing to determine which AZ gets which traffic; instead that’s handled by DNS (and to some extent, ELBs forwarding traffic between themselves). When a client hits ELB(s) in an individual AZ, requests are balanced across all instances in that AZ. This means that you can will observe all your repeated requests end up in a single AZ, but rotating between all instances. Unless you enable stickiness, of course.
So you enable stickiness because you want requests from the same client to end up at the same node (for caching, in this example).
Then the inevitable happens — a few of your nodes in the same AZ have hardware failures. You notice the load getting way too high in the remaining 2 instances, and decide that you’d really prefer to balance out that traffic between three other AZs. Like any logical person, you remove the instance in the AZ from the ELB, expecting the ELB(s) to notice there are no longer any instances in that AZ, and stop sending traffic (there’s nothing to receive it!).
This is when you start hearing about failed requests.
If you only remove instances from an ELB, and don’t remove the AZ itself from the ELB, the cluster of ELBs will still try very hard to forward traffic toward the AZ associated with the shared sticky state they know about. Your traffic is forwarded to, and hits an LB with no instances, and dies.
The moral of the story is: if you use stickiness, and want to disable a zone, you must remove the ZONE from the ELB, not just all instances.
That was an annoying lesson learned 😉
Quick side-note: if you use ELB-generated cookies for stickiness, the ELBs will remove your Cache-Control headers!
It’s best to configure varnish/apache/etc set a Served-By cookie, and tell the ELB to use that. That also makes it really easy to tell which node served a request, if you’re debugging via ‘curl -v’ or similar.
1 Comment »