Skip to content

Inefficient DNS change detection causes unnecessary RPC retries and request blocking #7433

@buger

Description

@buger

Summary

v5.8.2 added DNS change detection to handle dynamic MDCB endpoints, but the implementation is inefficient and causes unnecessary retries. The code retries ALL network errors hoping DNS changed, instead of proactively checking if the DNS resolution actually changed.

Current (Inefficient) Approach

File: rpc/rpc_client.go:516-541 and rpc/dns_resolver.go:86-106

Flow

1. CallTimeout() fails → "Cannot obtain response during timeout"
2. isNetworkError() returns true (matches timeout pattern)
3. checkAndHandleDNSChange() called:
   - Resolves DNS for connection string
   - Compares with cached IP
   - If same: returns (false, false) → Don't retry
   - If different: reconnects and returns (true, true) → Retry
4. If DNS unchanged: Returns error → backoff retries 3 times
5. Each retry waits full 30-second timeout
6. Total: up to 90 seconds blocked

Problems

  1. Assumes network error = DNS changed - Wrong! Network errors happen for many reasons:

    • MDCB service is down
    • Network partition
    • Firewall blocking connection
    • MDCB is overloaded
    • Connection timeout
  2. Only checks DNS AFTER failure - Reactive, not proactive

  3. Doesn't prevent retries when DNS unchanged - Still retries 3 times × 30s each

  4. Blocks request thread - User waits 90 seconds for nothing

Proposed Better Approach

Option 1: Proactive DNS Monitoring (Recommended)

Monitor DNS changes in background, not during request failures:

type DNSMonitor struct {
    hostname       string
    currentIPs     []net.IP
    lastCheck      time.Time
    checkInterval  time.Duration
    onChange       func(oldIPs, newIPs []net.IP)
}

func (m *DNSMonitor) Start(ctx context.Context) {
    ticker := time.NewTicker(m.checkInterval)  // e.g., every 30 seconds
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            m.checkDNS()
        }
    }
}

func (m *DNSMonitor) checkDNS() {
    newIPs, err := net.LookupIP(m.hostname)
    if err != nil {
        log.WithError(err).Warning("DNS lookup failed")
        return
    }
    
    if !m.ipsEqual(m.currentIPs, newIPs) {
        log.WithFields(logrus.Fields{
            "old_ips": m.currentIPs,
            "new_ips": newIPs,
        }).Info("DNS change detected, reconnecting RPC")
        
        oldIPs := m.currentIPs
        m.currentIPs = newIPs
        
        if m.onChange != nil {
            m.onChange(oldIPs, newIPs)
        }
    }
}

RPC Client Integration:

func Connect(...) {
    // ... existing connection code ...
    
    // Start DNS monitor
    dnsMonitor := &DNSMonitor{
        hostname: parseHostname(connConfig.ConnectionString),
        checkInterval: 30 * time.Second,
        onChange: func(oldIPs, newIPs []net.IP) {
            // Reconnect RPC client with new IPs
            safeReconnectRPCClient(false)
        },
    }
    go dnsMonitor.Start(ctx)
}

Benefits:

  • ✅ Detects DNS changes before they cause failures
  • ✅ Automatic reconnection when IP changes
  • ✅ No request blocking - all done in background
  • ✅ Faster recovery time (30s vs wait for request to fail)
  • ✅ Can configure check interval (e.g., 10s, 30s, 60s)

Option 2: Smarter Error Handling

If we must keep reactive approach, at least check DNS before retrying:

func FuncClientSingleton(funcName string, request interface{}) (result interface{}, err error) {
    be := backoff.Retry(func() error {
        if !values.ClientIsConnected() {
            return ErrRPCIsDown
        }
        
        result, err = funcClientSingleton.CallTimeout(funcName, request, GlobalRPCCallTimeout)
        
        if err != nil {
            // Check if DNS actually changed
            dnsChanged := checkDNSChange(values.Config().ConnectionString)
            
            if dnsChanged {
                // DNS changed! Reconnect and retry
                safeReconnectRPCClient(false)
                return nil  // Tell backoff to retry
            }
            
            // DNS unchanged - fail fast, don't retry
            return backoff.Permanent(err)
        }
        
        return nil
    }, backoff.WithMaxRetries(
        backoff.NewConstantBackOff(10*time.Millisecond), 3,
    ))
    if be != nil {
        err = be
    }
    return
}

func checkDNSChange(connectionString string) bool {
    hostname := parseHostname(connectionString)
    
    // Get current IPs
    newIPs, err := net.LookupIP(hostname)
    if err != nil {
        return false  // DNS lookup failed, don't retry
    }
    
    // Compare with cached IPs
    cachedIPs := getCachedIPs(hostname)
    if !ipsEqual(cachedIPs, newIPs) {
        updateCachedIPs(hostname, newIPs)
        return true
    }
    
    return false
}

Benefits:

  • ✅ Only retries if DNS actually changed
  • ✅ Fails fast if DNS unchanged (~30ms instead of 90s)
  • ✅ Uses backoff.Permanent() correctly

Current Code Issues

Issue 1: Retries Even When DNS Unchanged

File: rpc/dns_resolver.go:86-106

func checkAndHandleDNSChange(connectionString string, suppressRegister bool) (dnsChanged bool, shouldRetry bool) {
    // Skip if we've already checked DNS after an error
    if values.GetDNSCheckedAfterError() || values.GetEmergencyMode() {
        Log.Debug("Skipping DNS check - already checked or in emergency mode")
        return false, false  // ← Returns "don't retry" but...
    }
    
    // Mark that we've checked DNS
    values.SetDNSCheckedAfterError(true)
    
    // Check if DNS has changed and reconnect if needed
    if changed := checkDNSAndReconnect(connectionString, suppressRegister); changed {
        return true, true // DNS changed, should retry
    }
    
    // DNS hasn't changed
    Log.Warning("MDCB connection failed. DNS unchanged - check MDCB service and network.")
    return false, false  // ← DNS unchanged, says "don't retry"
}

File: rpc/rpc_client.go:524-530

if err != nil {
    if handleRPCError(err, values.Config().ConnectionString) {
        return nil  // Retry
    }
    return err  // ← But backoff.Retry DOES retry on error!
}

Problem: Returning err tells backoff to retry, even though DNS didn't change!

Issue 2: Checks DNS on EVERY Network Error

File: rpc/utils.go:24-36

func isNetworkError(err error) bool {
    if err == nil {
        return false
    }
    
    errStr := err.Error()
    
    // Check for specific network-related error patterns
    return strings.Contains(errStr, "unexpected response type: <nil>. Expected *dispatcherResponse") ||
        strings.Contains(errStr, "Cannot obtain response during timeout") ||
        strings.Contains(errStr, "rpc is either down or was not configured") ||
        strings.Contains(errStr, "Cannot decode response")
}

Problem: Treats ALL these errors as "maybe DNS changed", when most of them are NOT DNS-related:

  • "Cannot obtain response during timeout" → MDCB is down or slow, not DNS!
  • "rpc is either down or was not configured" → Configuration issue, not DNS!
  • "Cannot decode response" → Protocol issue, not DNS!

Only legitimate DNS-related errors:

  • dial tcp: lookup mdcb.example.com: no such host
  • dial tcp: lookup mdcb.example.com: i/o timeout

Impact

Current Behavior (v5.8.6)

When MDCB is down:

  1. Request calls RPC → timeout after 30s
  2. Checks DNS → unchanged
  3. Returns error → backoff retries
  4. Retry Fixed some typos #2 → timeout after 30s
  5. Returns error → backoff retries
  6. Retry Fixed small typo :) #3 → timeout after 30s
  7. Total: 90 seconds blocked

With Option 1 (Proactive Monitoring)

When MDCB is down:

  1. Background DNS monitor detects MDCB hostname still resolves to same IP
  2. Request calls RPC → timeout after 30s
  3. Checks if DNS changed recently (background monitor says no)
  4. Returns backoff.Permanent(err)no retries
  5. Total: 30 seconds blocked (just 1 attempt)

When DNS actually changes:

  1. Background monitor detects IP change every 30s
  2. Automatically reconnects RPC client
  3. Requests use new connection immediately
  4. Zero downtime for DNS changes

With Option 2 (Smarter Error Handling)

When MDCB is down:

  1. Request calls RPC → timeout after 30s
  2. Checks DNS inline → unchanged
  3. Returns backoff.Permanent(err)no retries
  4. Total: 30 seconds blocked (just 1 attempt)

When DNS actually changes:

  1. Request calls RPC → timeout after 30s
  2. Checks DNS inline → changed!
  3. Reconnects and retries → success
  4. Total: 30-60 seconds (1-2 attempts)

Recommended Solution

Implement Option 1 (Proactive DNS Monitoring)

Why:

  • Detects DNS changes immediately (within check interval)
  • No request blocking - all done in background
  • Automatic recovery without waiting for failures
  • Configurable check interval
  • Clean separation of concerns

Implementation:

  1. Add DNSMonitor component to rpc/dns_monitor.go
  2. Start monitor in rpc.Connect()
  3. Monitor calls safeReconnectRPCClient() when IP changes
  4. Remove reactive DNS check from handleRPCError()
  5. Use backoff.Permanent() for non-retryable errors

Configuration:

{
  "slave_options": {
    "use_rpc": true,
    "connection_string": "mdcb.example.com:9091",
    "dns_monitor": {
      "enabled": true,
      "check_interval": 30
    }
  }
}

Testing Plan

Test 1: DNS Change Detection

func TestDNSMonitorDetectsChange(t *testing.T) {
    // Start with DNS pointing to 192.168.1.10
    monitor := &DNSMonitor{
        hostname: "mdcb.local",
        checkInterval: 1 * time.Second,
    }
    
    changedChan := make(chan bool)
    monitor.onChange = func(old, new []net.IP) {
        changedChan <- true
    }
    
    go monitor.Start(context.Background())
    
    // Change DNS to point to 192.168.1.20
    updateDNS("mdcb.local", "192.168.1.20")
    
    // Wait for detection (should be within check interval)
    select {
    case <-changedChan:
        // Success!
    case <-time.After(5 * time.Second):
        t.Fatal("DNS change not detected")
    }
}

Test 2: No Unnecessary Retries

func TestNoRetriesWhenDNSUnchanged(t *testing.T) {
    // Stop MDCB
    mdcb.Stop()
    
    // Make request
    start := time.Now()
    _, err := FuncClientSingleton("GetSession", "org123")
    duration := time.Since(start)
    
    // Should fail after 1 attempt only
    assert.Error(t, err)
    assert.Less(t, duration, 35*time.Second, "Should not retry")
}

Test 3: Automatic Reconnection on DNS Change

func TestAutoReconnectOnDNSChange(t *testing.T) {
    // Start with MDCB at 192.168.1.10
    Connect(...)
    
    // Make successful request
    _, err := FuncClientSingleton("Login", apiKey)
    assert.NoError(t, err)
    
    // Move MDCB to 192.168.1.20 and update DNS
    mdcb.MoveTo("192.168.1.20")
    updateDNS("mdcb.local", "192.168.1.20")
    
    // Wait for monitor to detect and reconnect
    time.Sleep(2 * time.Second)
    
    // Next request should work
    _, err = FuncClientSingleton("GetSession", "org123")
    assert.NoError(t, err)
}

Related Issues

This issue specifically addresses the inefficient DNS change detection that makes #7432 worse by causing unnecessary retries.

Migration Path

  1. v5.8.x patch: Use backoff.Permanent() to prevent retries when DNS unchanged (quick fix)
  2. v5.9.0: Implement proactive DNS monitoring
  3. v5.9.x: Add configuration options for DNS monitor

References

Current Implementation Files

  • rpc/rpc_client.go:516-541 - RPC call with error handling
  • rpc/rpc_client.go:543-561 - handleRPCError function
  • rpc/utils.go:24-36 - isNetworkError (too broad)
  • rpc/dns_resolver.go:86-106 - checkAndHandleDNSChange (reactive)

New Files Needed

  • rpc/dns_monitor.go - Proactive DNS monitoring component

Bottom line: The current approach is reactive and inefficient. We should monitor DNS changes proactively in the background, not during request failures. This will eliminate unnecessary retries and provide faster recovery when DNS actually changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions