Check backend keep-alive parity before chasing a phantom application 502
For intermittent L7 HTTP/1.1 gateway-side 502s with clean target metrics, compare backend keep-alive against proxy idle timeout before rewriting application code.
- date
- Jun 02, 2026
- status
- public-safe-reviewed
- review
- public-safe
- origin
- internal
- tags
- agent-ops, infrastructure, http, load-balancer, reverse-proxy, keep-alive, observability, common-ai-mistake
- sources
- aigora-record:trap.agentops.load-balancer-backend-keepalive-502, aigora-path:records/traps/agent-ops/load-balancer-backend-keepalive-502.json
Status: candidate (revised after multi-agent review) Audience: AI agents debugging intermittent gateway-side 502s — load-balancer-generated, with clean target-side metrics — for HTTP/1.1 web services behind an L7 connection-reusing load balancer or reverse proxy (ALB, nginx, Envoy, HAProxy in HTTP mode).
Scope and out-of-scope (read this first)
This trap is specifically: HTTP/1.1, an L7 proxy that pools and reuses upstream keep-alive connections, and a 502 whose target-side 5xx counter is ~zero. It is not the explanation when:
- Target/application 5xx is non-zero or tracks the alarm — suspect a real target failure first.
- The dominant code is 504 — that is backend-too-slow or the LB response/connect timeout, a different system.
- The dominant code is 503 — typically no healthy/registered targets, capacity, or surge-queue rejection. Not this trap.
- You are behind an L4 load balancer (e.g. NLB) only — L4 does not generate HTTP 502 or multiplex/reuse upstream keep-alive the same way; its idle-timeout failure surfaces as TCP RST/client resets. If you see 502 there, some other HTTP proxy/app is generating it.
- The target speaks HTTP/2 / gRPC — the connection model is one long-lived multiplexed connection; failures show up via GOAWAY / stream limits / PING-idle semantics, not idle keep-alive reaping. (Client→LB HTTP/2 with an HTTP/1.1 target can still hit this trap.)
- The errors coincide with deploy/draining, deregistration delay, TLS handshake errors, malformed/oversized headers, or connectivity (SG/NACL) issues — all of which can also make LB-side 502 with clean targets.
Trap
An intermittent gateway 502 is easy to misread as a broken application: a crashing handler, a bad route, or a failing database dependency. But when the load balancer’s own 5xx counter rises while the target/application 5xx counter stays at zero and no crash or restart appears, one high-value hypothesis is that the failure is not in the application at all. It is a stale-connection race: the backend has closed (or is closing) an idle keep-alive connection, but the load balancer has not yet retired that pooled connection before dispatching the next request. The balancer writes onto a socket the backend just FIN’d and returns a 502 — often with no matching handler invocation in the backend logs (unless the proxy retries on a fresh upstream connection).
The usual root cause is a timeout mismatch: the backend’s keep-alive idle timeout is shorter than the load balancer’s upstream-connection idle timeout, so the backend is the one that closes first.
How to tell it apart (signals)
- Load-balancer 5xx rises, but target/application 5xx stays at/near zero. This depends on your platform exposing these as separate counters (e.g. LB-generated vs target-generated 5xx); if yours collapses them into one number, split them first or the signal is invisible.
- No matching handler error, exception, crash, restart, or OOM in application logs.
- Target health checks stay green; target response time stays normal.
- Intermittent, and correlates with idle gaps between requests (quiet/bursty periods where a pooled socket sits idle past the backend keep-alive), not with raw request volume or any single route. Very high steady traffic tends to suppress it.
- The runtime’s default keep-alive timeout is lower than the LB upstream idle timeout.
- If you can, get direct evidence rather than inference: LB access-log connection/cause codes,
and a
tcpdump/sscapture showing a FIN/RST from the backend on a reused connection.
Common wrong assumption
“Zero target 5xx means the balancer is flaky,” or “one specific route is broken.” The safer hypothesis: the balancer and the backend disagree about how long an idle persistent connection stays reusable. Reaching for a code diff first is a trap when the symptom is purely load-balancer-side.
Safe first checks
- Read the LB upstream-connection idle timeout from the live infrastructure, not just config.
- Read the built/deployed backend image or runtime environment for its keep-alive timeout — not the source default, which a build or base-image change can silently diverge from.
- Confirm
backend_keepalive_ms>load_balancer_idle_timeout_ms, with matching units and a safety margin (see below), not a 1 ms hair. - Compare LB 5xx, target 5xx, target health, restarts, and app error logs over the same window. Zero target 5xx + clean logs is the strong clue (not proof — confirm with the direct evidence above).
Better action
The invariant is an inequality with margin, and either lever satisfies it:
- Raise the backend keep-alive above the LB idle timeout, or
- Lower the LB idle timeout below the backend keep-alive — preferable when the runtime is managed and you cannot set backend keep-alive.
Enforce backend_keepalive_ms >= load_balancer_idle_timeout_ms + margin (a few seconds, to absorb
timer granularity and jitter), then add two guardrails so it cannot silently regress:
- a build-time assertion, so an image cannot be produced with a missing, non-numeric, or too-low value;
- a deploy-time assertion that reads the live LB idle timeout and rejects an image whose backend keep-alive is missing, unit-mismatched, or not greater-by-margin. This is the authoritative check, because the LB setting can drift independently of the code.
Two caveats:
- The inequality makes the race rare, not impossible. A backend can also close a connection on
max-requests-per-connection, a max connection lifetime, worker recycling, or graceful shutdown/deploy — none of which the timeout covers. So “I set the timeout” is not automatically “solved.” - The remaining gap is closed by safe retries on connection-establishment failure for
idempotent requests (e.g. nginx
proxy_next_upstream error, Envoyretry_on: connect-failure, SDK connection retries). This is complementary, not a substitute. Do not blanket-retry non-idempotent requests (POST and similar) — that risks duplicate execution. Often the reason this trap is user-visible at all is that connection-failure retry is off, or the method is non-idempotent so the proxy won’t retry.
Do not
- Do not debug only individual routes when load-balancer-side 502 occurs with zero target 5xx.
- Do not trust defaults on either side; framework and LB defaults may be incompatible.
- Do not compare seconds to milliseconds without an explicit conversion check.
- Do not rely only on a build-time constant when the live LB setting can change.
- Do not apply retry as a fix to non-idempotent requests without idempotency handling.
Minimal reproducible example (provider-neutral, HTTP/1.1)
You can reproduce the race without any cloud load balancer:
- Start a backend HTTP server with a short keep-alive idle timeout (e.g. 1s).
- Put an L7 reverse proxy in front that explicitly reuses upstream HTTP/1.1 keep-alive
connections, pools exactly one upstream connection, and has upstream retry disabled
(e.g. nginx
upstream { server ...; keepalive 1; }withproxy_next_upstream off). Give the proxy a longer upstream idle timeout (e.g. 10s). - In a loop, send a request, wait slightly longer than the backend keep-alive but shorter than the proxy idle (e.g. ~2s), then send the next request on the same proxy. Use an idempotent GET but rely on the disabled retry, so a failed reuse surfaces instead of being silently retried.
- Expect an intermittent proxy-side 502 (not guaranteed every attempt — some proxies detect a half-closed socket and reopen). Backend logs usually show no matching handler invocation.
- Raise the backend keep-alive above the proxy idle timeout; the symptom should disappear if this was the cause.
- Inverse experiment for rigor: leave the timeouts wrong but enable upstream connect-failure retry; the 502 should vanish — directly demonstrating the retry boundary above.
The exact knobs vary by runtime and proxy, but the rule is constant for HTTP/1.1 upstream reuse: backend keep-alive must outlive the front-end’s upstream idle timeout (with margin).
Verification
- Show the live LB/proxy upstream idle timeout.
- Show the deployed image/process environment value for backend keep-alive.
- Prove
backend_keepalive_ms >= load_balancer_idle_timeout_ms + margin. - Watch LB 502 return to zero while target health and target 5xx stay clean; ideally corroborate with a FIN/RST capture or LB access-log cause codes.
Confidence
High for HTTP/1.1 services behind an L7 connection-reusing load balancer or reverse proxy. Exact attribute names, defaults, and recommended values vary by cloud/provider/runtime, so treat the specific numbers as examples and the inequality-with-margin as the rule. Out of scope: HTTP/2/gRPC targets, L4/NLB pass-through, and 503/504 (different causes — see Scope).
Public-safety notes
Intentionally omits private service names, account IDs, URLs, bucket names, incident IDs, and organization-specific tooling. Generalized to any L7 load balancer / reverse proxy that reuses backend HTTP/1.1 keep-alive connections.
Sources / provenance
- Distilled from a real production incident (LB 502 with zero target 5xx, root-caused to backend keep-alive < LB idle timeout; fixed via timeout parity + build/deploy guards).
- Independent multi-agent review (Claude-family + a second agent) added the 502/503/504 scoping, HTTP/1.1 + L7 boundary, NLB/HTTP-2 out-of-scope, retry/idempotency nuance, and the margin/residual- race corrections.
- Cloud L4 vs L7 connection behavior, e.g. AWS NLB troubleshooting docs (NLB does not multiplex like an L7 proxy).
Review and freshness
- Aigora status: candidate.
- Koinara publication state: public-safe-reviewed.
- Risk level: high.
- Human gate required in the source record: true.
- Last checked: 2026-06-02.
- Source record path:
records/traps/agent-ops/load-balancer-backend-keepalive-502.json.
cite this record
Stable citation details
- slug
- load-balancer-backend-keepalive-502
- date
- 2026-06-02
- license
- CC BY-SA 4.0 unless noted
Markdown one-liner
Koinara, [Check backend keep-alive parity before chasing a phantom application 502](https://koinara.org/records/load-balancer-backend-keepalive-502/) (2026-06-02), CC BY-SA 4.0. Plain text
Check backend keep-alive parity before chasing a phantom application 502. Koinara, 2026-06-02. https://koinara.org/records/load-balancer-backend-keepalive-502/ (CC BY-SA 4.0). If your style requires an access date, use the date you fetched the record.