Skip to content

feat: add TLD-level domain failover #276

Open
sesky4 wants to merge 1 commit into
TencentCloud:masterfrom
sesky4:endpoint-fallback
Open

feat: add TLD-level domain failover #276
sesky4 wants to merge 1 commit into
TencentCloud:masterfrom
sesky4:endpoint-fallback

Conversation

@sesky4
Copy link
Copy Markdown
Contributor

@sesky4 sesky4 commented Apr 20, 2026

No description provided.

@sesky4 sesky4 changed the title feat: add TLD-level domain failover for all signing methods feat: add TLD-level domain failover Apr 20, 2026
@sesky4 sesky4 force-pushed the endpoint-fallback branch 2 times, most recently from 47ce105 to 342802f Compare April 24, 2026 09:42
@sesky4 sesky4 force-pushed the endpoint-fallback branch from a82ee27 to d47aeb0 Compare May 19, 2026 07:41
@sesky4 sesky4 force-pushed the endpoint-fallback branch from b8c0be2 to b204044 Compare June 1, 2026 12:09
Rebuild the SDK's region-failover mechanism around a single OkHttp
interceptor that retries the request against an alternate host when
the current one is unhealthy. Replaces the per-client regionBreaker
plumbing in AbstractClient and the legacy "backup endpoint" branch
that used to live alongside it.

Plan-then-execute pipeline
--------------------------
intercept(request)
  → planFor(request) decides candidate hosts and order
  → plan.run(chain) walks candidates with per-host circuit breakers,
    re-signs each request for its target host, aggregates failures

Two modes share the pipeline:

  * backupEndpoint mode (legacy, opt-in):
      candidates = [origin, <service>.<backupEndpoint>]
      eligible for any host the user configured — including
      region-pinned hosts and proxies.

  * TLD rotation mode (default):
      rotate within the host's TLD family. Three families recognised:
        - plain     tencentcloudapi.{com,cn,com.cn}
        - ai        ai.tencentcloudapi.{com,cn,com.cn}
        - internal  internal.tencentcloudapi.{com,cn,com.cn}
      Region-pinned hosts (cvm.ap-guangzhou.tencentcloudapi.com etc.)
      and unrecognised hosts skip TLD rotation — failing them over
      would silently change the resolved region or send the request
      to a bogus alternate. Only the explicit backupEndpoint opt-in
      may override this.

Failure classification
----------------------
A candidate counts as failed and the next one is tried when:

  Transport-level (chain.proceed throws):
    UnknownHostException, SSL{Handshake,PeerUnverified}Exception,
    ConnectException, NoRouteToHostException,
    PortUnreachableException, SocketTimeoutException

  Protocol-level (chain.proceed returns):
    HTTP status != 200
    Content-Type advertised as JSON but body is not a parseable
    JSON object/array (transparent-proxy block pages, ISP hijacks,
    poisoned bodies)

The body is buffered and the Response rebuilt so downstream parsing
sees a fresh body. SSE and binary responses skip the body check and
look at the status code only. Application errors raised inside the
SDK (signing, deserialisation) propagate immediately.

Cost: a 4xx caused by a malformed user request is now retried 3×
before surfacing. Accepted as a trade-off — at the interceptor layer
"user-error 4xx" and "proxy-block 4xx" are indistinguishable, and
the latter is the case worth defending.

Per-host circuit breakers
-------------------------
FailoverState holds a Map<host, CircuitBreaker> plus
preferred_tld_idx and origin_probe_after_ms. Successive failures
trip a host's breaker Open for 60 s; further attempts are skipped
without hitting the network. The last-known-working TLD is tried
first on subsequent calls; the user's original TLD is reprobed
once its cooldown elapses so traffic can return to it after recovery.

State is scoped per AbstractClient instance, not process-global.
Callers wanting shared state across clients can reuse one client;
callers wanting isolation can construct multiple. This matches the
convention of AWS / Azure SDKs and resilience4j / Hystrix.

Re-signing
----------
Signing inputs are recovered from the outgoing Request and the
credential is read live from AbstractClient on every retry, so
STS / OIDC / CVM-role rotation is honoured between attempts.
Supports TC3-HMAC-SHA256, HmacSHA1, HmacSHA256, and the
"Authorization: SKIP" mode used by some streaming endpoints.

Configuration
-------------
HttpProfile.setDomainFailover(boolean) — opt-out switch
ClientProfile.setBackupEndpoint(String) — unchanged, now routed
through the same pipeline.

Backwards compatibility
-----------------------
The following public API on AbstractClient and SSEResponseModel
is retained as @deprecated no-ops / delegates so existing user code
continues to link:

  AbstractClient.{get,set}RegionBreaker(...)
  AbstractClient.processResponseSSE(resp, type, breakerToken)
  AbstractClient.processResponseJson(resp, type, breakerToken)
  SSEResponseModel.setToken(...)

Tests
-----
48 tests in EndpointFailoverInterceptorTest cover host classification,
each family's TLD rotation, region-pinned skip, backupEndpoint mode,
breaker open/close lifecycle, origin reprobe, aggregated failure,
non-200 / invalid-JSON triggering, credential rotation between
retries, and signing-mode-specific re-sign correctness.
@sesky4 sesky4 force-pushed the endpoint-fallback branch from b204044 to a38e5d1 Compare June 1, 2026 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant