-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
v5.8.2 added DNS change detection to handle dynamic MDCB endpoints, but the implementation is inefficient and causes unnecessary retries. The code retries ALL network errors hoping DNS changed, instead of proactively checking if the DNS resolution actually changed.
Current (Inefficient) Approach
File: rpc/rpc_client.go:516-541 and rpc/dns_resolver.go:86-106
Flow
1. CallTimeout() fails → "Cannot obtain response during timeout"
2. isNetworkError() returns true (matches timeout pattern)
3. checkAndHandleDNSChange() called:
- Resolves DNS for connection string
- Compares with cached IP
- If same: returns (false, false) → Don't retry
- If different: reconnects and returns (true, true) → Retry
4. If DNS unchanged: Returns error → backoff retries 3 times
5. Each retry waits full 30-second timeout
6. Total: up to 90 seconds blocked
Problems
-
Assumes network error = DNS changed - Wrong! Network errors happen for many reasons:
- MDCB service is down
- Network partition
- Firewall blocking connection
- MDCB is overloaded
- Connection timeout
-
Only checks DNS AFTER failure - Reactive, not proactive
-
Doesn't prevent retries when DNS unchanged - Still retries 3 times × 30s each
-
Blocks request thread - User waits 90 seconds for nothing
Proposed Better Approach
Option 1: Proactive DNS Monitoring (Recommended)
Monitor DNS changes in background, not during request failures:
type DNSMonitor struct {
hostname string
currentIPs []net.IP
lastCheck time.Time
checkInterval time.Duration
onChange func(oldIPs, newIPs []net.IP)
}
func (m *DNSMonitor) Start(ctx context.Context) {
ticker := time.NewTicker(m.checkInterval) // e.g., every 30 seconds
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
m.checkDNS()
}
}
}
func (m *DNSMonitor) checkDNS() {
newIPs, err := net.LookupIP(m.hostname)
if err != nil {
log.WithError(err).Warning("DNS lookup failed")
return
}
if !m.ipsEqual(m.currentIPs, newIPs) {
log.WithFields(logrus.Fields{
"old_ips": m.currentIPs,
"new_ips": newIPs,
}).Info("DNS change detected, reconnecting RPC")
oldIPs := m.currentIPs
m.currentIPs = newIPs
if m.onChange != nil {
m.onChange(oldIPs, newIPs)
}
}
}RPC Client Integration:
func Connect(...) {
// ... existing connection code ...
// Start DNS monitor
dnsMonitor := &DNSMonitor{
hostname: parseHostname(connConfig.ConnectionString),
checkInterval: 30 * time.Second,
onChange: func(oldIPs, newIPs []net.IP) {
// Reconnect RPC client with new IPs
safeReconnectRPCClient(false)
},
}
go dnsMonitor.Start(ctx)
}Benefits:
- ✅ Detects DNS changes before they cause failures
- ✅ Automatic reconnection when IP changes
- ✅ No request blocking - all done in background
- ✅ Faster recovery time (30s vs wait for request to fail)
- ✅ Can configure check interval (e.g., 10s, 30s, 60s)
Option 2: Smarter Error Handling
If we must keep reactive approach, at least check DNS before retrying:
func FuncClientSingleton(funcName string, request interface{}) (result interface{}, err error) {
be := backoff.Retry(func() error {
if !values.ClientIsConnected() {
return ErrRPCIsDown
}
result, err = funcClientSingleton.CallTimeout(funcName, request, GlobalRPCCallTimeout)
if err != nil {
// Check if DNS actually changed
dnsChanged := checkDNSChange(values.Config().ConnectionString)
if dnsChanged {
// DNS changed! Reconnect and retry
safeReconnectRPCClient(false)
return nil // Tell backoff to retry
}
// DNS unchanged - fail fast, don't retry
return backoff.Permanent(err)
}
return nil
}, backoff.WithMaxRetries(
backoff.NewConstantBackOff(10*time.Millisecond), 3,
))
if be != nil {
err = be
}
return
}
func checkDNSChange(connectionString string) bool {
hostname := parseHostname(connectionString)
// Get current IPs
newIPs, err := net.LookupIP(hostname)
if err != nil {
return false // DNS lookup failed, don't retry
}
// Compare with cached IPs
cachedIPs := getCachedIPs(hostname)
if !ipsEqual(cachedIPs, newIPs) {
updateCachedIPs(hostname, newIPs)
return true
}
return false
}Benefits:
- ✅ Only retries if DNS actually changed
- ✅ Fails fast if DNS unchanged (~30ms instead of 90s)
- ✅ Uses
backoff.Permanent()correctly
Current Code Issues
Issue 1: Retries Even When DNS Unchanged
File: rpc/dns_resolver.go:86-106
func checkAndHandleDNSChange(connectionString string, suppressRegister bool) (dnsChanged bool, shouldRetry bool) {
// Skip if we've already checked DNS after an error
if values.GetDNSCheckedAfterError() || values.GetEmergencyMode() {
Log.Debug("Skipping DNS check - already checked or in emergency mode")
return false, false // ← Returns "don't retry" but...
}
// Mark that we've checked DNS
values.SetDNSCheckedAfterError(true)
// Check if DNS has changed and reconnect if needed
if changed := checkDNSAndReconnect(connectionString, suppressRegister); changed {
return true, true // DNS changed, should retry
}
// DNS hasn't changed
Log.Warning("MDCB connection failed. DNS unchanged - check MDCB service and network.")
return false, false // ← DNS unchanged, says "don't retry"
}File: rpc/rpc_client.go:524-530
if err != nil {
if handleRPCError(err, values.Config().ConnectionString) {
return nil // Retry
}
return err // ← But backoff.Retry DOES retry on error!
}Problem: Returning err tells backoff to retry, even though DNS didn't change!
Issue 2: Checks DNS on EVERY Network Error
File: rpc/utils.go:24-36
func isNetworkError(err error) bool {
if err == nil {
return false
}
errStr := err.Error()
// Check for specific network-related error patterns
return strings.Contains(errStr, "unexpected response type: <nil>. Expected *dispatcherResponse") ||
strings.Contains(errStr, "Cannot obtain response during timeout") ||
strings.Contains(errStr, "rpc is either down or was not configured") ||
strings.Contains(errStr, "Cannot decode response")
}Problem: Treats ALL these errors as "maybe DNS changed", when most of them are NOT DNS-related:
"Cannot obtain response during timeout"→ MDCB is down or slow, not DNS!"rpc is either down or was not configured"→ Configuration issue, not DNS!"Cannot decode response"→ Protocol issue, not DNS!
Only legitimate DNS-related errors:
dial tcp: lookup mdcb.example.com: no such hostdial tcp: lookup mdcb.example.com: i/o timeout
Impact
Current Behavior (v5.8.6)
When MDCB is down:
- Request calls RPC → timeout after 30s
- Checks DNS → unchanged
- Returns error → backoff retries
- Retry Fixed some typos #2 → timeout after 30s
- Returns error → backoff retries
- Retry Fixed small typo :) #3 → timeout after 30s
- Total: 90 seconds blocked
With Option 1 (Proactive Monitoring)
When MDCB is down:
- Background DNS monitor detects MDCB hostname still resolves to same IP
- Request calls RPC → timeout after 30s
- Checks if DNS changed recently (background monitor says no)
- Returns
backoff.Permanent(err)→ no retries - Total: 30 seconds blocked (just 1 attempt)
When DNS actually changes:
- Background monitor detects IP change every 30s
- Automatically reconnects RPC client
- Requests use new connection immediately
- Zero downtime for DNS changes
With Option 2 (Smarter Error Handling)
When MDCB is down:
- Request calls RPC → timeout after 30s
- Checks DNS inline → unchanged
- Returns
backoff.Permanent(err)→ no retries - Total: 30 seconds blocked (just 1 attempt)
When DNS actually changes:
- Request calls RPC → timeout after 30s
- Checks DNS inline → changed!
- Reconnects and retries → success
- Total: 30-60 seconds (1-2 attempts)
Recommended Solution
Implement Option 1 (Proactive DNS Monitoring)
Why:
- Detects DNS changes immediately (within check interval)
- No request blocking - all done in background
- Automatic recovery without waiting for failures
- Configurable check interval
- Clean separation of concerns
Implementation:
- Add
DNSMonitorcomponent torpc/dns_monitor.go - Start monitor in
rpc.Connect() - Monitor calls
safeReconnectRPCClient()when IP changes - Remove reactive DNS check from
handleRPCError() - Use
backoff.Permanent()for non-retryable errors
Configuration:
{
"slave_options": {
"use_rpc": true,
"connection_string": "mdcb.example.com:9091",
"dns_monitor": {
"enabled": true,
"check_interval": 30
}
}
}Testing Plan
Test 1: DNS Change Detection
func TestDNSMonitorDetectsChange(t *testing.T) {
// Start with DNS pointing to 192.168.1.10
monitor := &DNSMonitor{
hostname: "mdcb.local",
checkInterval: 1 * time.Second,
}
changedChan := make(chan bool)
monitor.onChange = func(old, new []net.IP) {
changedChan <- true
}
go monitor.Start(context.Background())
// Change DNS to point to 192.168.1.20
updateDNS("mdcb.local", "192.168.1.20")
// Wait for detection (should be within check interval)
select {
case <-changedChan:
// Success!
case <-time.After(5 * time.Second):
t.Fatal("DNS change not detected")
}
}Test 2: No Unnecessary Retries
func TestNoRetriesWhenDNSUnchanged(t *testing.T) {
// Stop MDCB
mdcb.Stop()
// Make request
start := time.Now()
_, err := FuncClientSingleton("GetSession", "org123")
duration := time.Since(start)
// Should fail after 1 attempt only
assert.Error(t, err)
assert.Less(t, duration, 35*time.Second, "Should not retry")
}Test 3: Automatic Reconnection on DNS Change
func TestAutoReconnectOnDNSChange(t *testing.T) {
// Start with MDCB at 192.168.1.10
Connect(...)
// Make successful request
_, err := FuncClientSingleton("Login", apiKey)
assert.NoError(t, err)
// Move MDCB to 192.168.1.20 and update DNS
mdcb.MoveTo("192.168.1.20")
updateDNS("mdcb.local", "192.168.1.20")
// Wait for monitor to detect and reconnect
time.Sleep(2 * time.Second)
// Next request should work
_, err = FuncClientSingleton("GetSession", "org123")
assert.NoError(t, err)
}Related Issues
- Request pipeline blocked by synchronous RPC calls every 10 minutes when MDCB unavailable #7428 - Blocking RPC calls in request path
- EnforceOrgDataAge forcibly enabled in RPC mode, breaking analytics when MDCB unavailable #7429 - EnforceOrgDataAge configuration issue
- Blocking RPC calls in critical request path cause request latency and timeouts #7432 - Architectural problem of blocking RPC calls
This issue specifically addresses the inefficient DNS change detection that makes #7432 worse by causing unnecessary retries.
Migration Path
- v5.8.x patch: Use
backoff.Permanent()to prevent retries when DNS unchanged (quick fix) - v5.9.0: Implement proactive DNS monitoring
- v5.9.x: Add configuration options for DNS monitor
References
Current Implementation Files
rpc/rpc_client.go:516-541- RPC call with error handlingrpc/rpc_client.go:543-561- handleRPCError functionrpc/utils.go:24-36- isNetworkError (too broad)rpc/dns_resolver.go:86-106- checkAndHandleDNSChange (reactive)
New Files Needed
rpc/dns_monitor.go- Proactive DNS monitoring component
Bottom line: The current approach is reactive and inefficient. We should monitor DNS changes proactively in the background, not during request failures. This will eliminate unnecessary retries and provide faster recovery when DNS actually changes.