feat: add TLD-level domain failover #276
Open
sesky4 wants to merge 1 commit into
Open
Conversation
47ce105 to
342802f
Compare
Rebuild the SDK's region-failover mechanism around a single OkHttp
interceptor that retries the request against an alternate host when
the current one is unhealthy. Replaces the per-client regionBreaker
plumbing in AbstractClient and the legacy "backup endpoint" branch
that used to live alongside it.
Plan-then-execute pipeline
--------------------------
intercept(request)
→ planFor(request) decides candidate hosts and order
→ plan.run(chain) walks candidates with per-host circuit breakers,
re-signs each request for its target host, aggregates failures
Two modes share the pipeline:
* backupEndpoint mode (legacy, opt-in):
candidates = [origin, <service>.<backupEndpoint>]
eligible for any host the user configured — including
region-pinned hosts and proxies.
* TLD rotation mode (default):
rotate within the host's TLD family. Three families recognised:
- plain tencentcloudapi.{com,cn,com.cn}
- ai ai.tencentcloudapi.{com,cn,com.cn}
- internal internal.tencentcloudapi.{com,cn,com.cn}
Region-pinned hosts (cvm.ap-guangzhou.tencentcloudapi.com etc.)
and unrecognised hosts skip TLD rotation — failing them over
would silently change the resolved region or send the request
to a bogus alternate. Only the explicit backupEndpoint opt-in
may override this.
Failure classification
----------------------
A candidate counts as failed and the next one is tried when:
Transport-level (chain.proceed throws):
UnknownHostException, SSL{Handshake,PeerUnverified}Exception,
ConnectException, NoRouteToHostException,
PortUnreachableException, SocketTimeoutException
Protocol-level (chain.proceed returns):
HTTP status != 200
Content-Type advertised as JSON but body is not a parseable
JSON object/array (transparent-proxy block pages, ISP hijacks,
poisoned bodies)
The body is buffered and the Response rebuilt so downstream parsing
sees a fresh body. SSE and binary responses skip the body check and
look at the status code only. Application errors raised inside the
SDK (signing, deserialisation) propagate immediately.
Cost: a 4xx caused by a malformed user request is now retried 3×
before surfacing. Accepted as a trade-off — at the interceptor layer
"user-error 4xx" and "proxy-block 4xx" are indistinguishable, and
the latter is the case worth defending.
Per-host circuit breakers
-------------------------
FailoverState holds a Map<host, CircuitBreaker> plus
preferred_tld_idx and origin_probe_after_ms. Successive failures
trip a host's breaker Open for 60 s; further attempts are skipped
without hitting the network. The last-known-working TLD is tried
first on subsequent calls; the user's original TLD is reprobed
once its cooldown elapses so traffic can return to it after recovery.
State is scoped per AbstractClient instance, not process-global.
Callers wanting shared state across clients can reuse one client;
callers wanting isolation can construct multiple. This matches the
convention of AWS / Azure SDKs and resilience4j / Hystrix.
Re-signing
----------
Signing inputs are recovered from the outgoing Request and the
credential is read live from AbstractClient on every retry, so
STS / OIDC / CVM-role rotation is honoured between attempts.
Supports TC3-HMAC-SHA256, HmacSHA1, HmacSHA256, and the
"Authorization: SKIP" mode used by some streaming endpoints.
Configuration
-------------
HttpProfile.setDomainFailover(boolean) — opt-out switch
ClientProfile.setBackupEndpoint(String) — unchanged, now routed
through the same pipeline.
Backwards compatibility
-----------------------
The following public API on AbstractClient and SSEResponseModel
is retained as @deprecated no-ops / delegates so existing user code
continues to link:
AbstractClient.{get,set}RegionBreaker(...)
AbstractClient.processResponseSSE(resp, type, breakerToken)
AbstractClient.processResponseJson(resp, type, breakerToken)
SSEResponseModel.setToken(...)
Tests
-----
48 tests in EndpointFailoverInterceptorTest cover host classification,
each family's TLD rotation, region-pinned skip, backupEndpoint mode,
breaker open/close lifecycle, origin reprobe, aggregated failure,
non-200 / invalid-JSON triggering, credential rotation between
retries, and signing-mode-specific re-sign correctness.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.