Inefficient DNS change detection causes unnecessary RPC retries and request blocking

## Summary

v5.8.2 added DNS change detection to handle dynamic MDCB endpoints, but the implementation is **inefficient and causes unnecessary retries**. The code retries ALL network errors hoping DNS changed, instead of **proactively checking if the DNS resolution actually changed**.

## Current (Inefficient) Approach

**File:** `rpc/rpc_client.go:516-541` and `rpc/dns_resolver.go:86-106`

### Flow

```
1. CallTimeout() fails → "Cannot obtain response during timeout"
2. isNetworkError() returns true (matches timeout pattern)
3. checkAndHandleDNSChange() called:
   - Resolves DNS for connection string
   - Compares with cached IP
   - If same: returns (false, false) → Don't retry
   - If different: reconnects and returns (true, true) → Retry
4. If DNS unchanged: Returns error → backoff retries 3 times
5. Each retry waits full 30-second timeout
6. Total: up to 90 seconds blocked
```

### Problems

1. **Assumes network error = DNS changed** - Wrong! Network errors happen for many reasons:
   - MDCB service is down
   - Network partition
   - Firewall blocking connection
   - MDCB is overloaded
   - Connection timeout
   
2. **Only checks DNS AFTER failure** - Reactive, not proactive

3. **Doesn't prevent retries when DNS unchanged** - Still retries 3 times × 30s each

4. **Blocks request thread** - User waits 90 seconds for nothing

## Proposed Better Approach

### Option 1: Proactive DNS Monitoring (Recommended)

**Monitor DNS changes in background**, not during request failures:

```go
type DNSMonitor struct {
    hostname       string
    currentIPs     []net.IP
    lastCheck      time.Time
    checkInterval  time.Duration
    onChange       func(oldIPs, newIPs []net.IP)
}

func (m *DNSMonitor) Start(ctx context.Context) {
    ticker := time.NewTicker(m.checkInterval)  // e.g., every 30 seconds
    defer ticker.Stop()
    
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            m.checkDNS()
        }
    }
}

func (m *DNSMonitor) checkDNS() {
    newIPs, err := net.LookupIP(m.hostname)
    if err != nil {
        log.WithError(err).Warning("DNS lookup failed")
        return
    }
    
    if !m.ipsEqual(m.currentIPs, newIPs) {
        log.WithFields(logrus.Fields{
            "old_ips": m.currentIPs,
            "new_ips": newIPs,
        }).Info("DNS change detected, reconnecting RPC")
        
        oldIPs := m.currentIPs
        m.currentIPs = newIPs
        
        if m.onChange != nil {
            m.onChange(oldIPs, newIPs)
        }
    }
}
```

**RPC Client Integration:**

```go
func Connect(...) {
    // ... existing connection code ...
    
    // Start DNS monitor
    dnsMonitor := &DNSMonitor{
        hostname: parseHostname(connConfig.ConnectionString),
        checkInterval: 30 * time.Second,
        onChange: func(oldIPs, newIPs []net.IP) {
            // Reconnect RPC client with new IPs
            safeReconnectRPCClient(false)
        },
    }
    go dnsMonitor.Start(ctx)
}
```

**Benefits:**
- ✅ Detects DNS changes **before** they cause failures
- ✅ Automatic reconnection when IP changes
- ✅ No request blocking - all done in background
- ✅ Faster recovery time (30s vs wait for request to fail)
- ✅ Can configure check interval (e.g., 10s, 30s, 60s)

### Option 2: Smarter Error Handling

If we must keep reactive approach, at least check DNS **before** retrying:

```go
func FuncClientSingleton(funcName string, request interface{}) (result interface{}, err error) {
    be := backoff.Retry(func() error {
        if !values.ClientIsConnected() {
            return ErrRPCIsDown
        }
        
        result, err = funcClientSingleton.CallTimeout(funcName, request, GlobalRPCCallTimeout)
        
        if err != nil {
            // Check if DNS actually changed
            dnsChanged := checkDNSChange(values.Config().ConnectionString)
            
            if dnsChanged {
                // DNS changed! Reconnect and retry
                safeReconnectRPCClient(false)
                return nil  // Tell backoff to retry
            }
            
            // DNS unchanged - fail fast, don't retry
            return backoff.Permanent(err)
        }
        
        return nil
    }, backoff.WithMaxRetries(
        backoff.NewConstantBackOff(10*time.Millisecond), 3,
    ))
    if be != nil {
        err = be
    }
    return
}

func checkDNSChange(connectionString string) bool {
    hostname := parseHostname(connectionString)
    
    // Get current IPs
    newIPs, err := net.LookupIP(hostname)
    if err != nil {
        return false  // DNS lookup failed, don't retry
    }
    
    // Compare with cached IPs
    cachedIPs := getCachedIPs(hostname)
    if !ipsEqual(cachedIPs, newIPs) {
        updateCachedIPs(hostname, newIPs)
        return true
    }
    
    return false
}
```

**Benefits:**
- ✅ Only retries if DNS actually changed
- ✅ Fails fast if DNS unchanged (~30ms instead of 90s)
- ✅ Uses `backoff.Permanent()` correctly

## Current Code Issues

### Issue 1: Retries Even When DNS Unchanged

**File:** `rpc/dns_resolver.go:86-106`

```go
func checkAndHandleDNSChange(connectionString string, suppressRegister bool) (dnsChanged bool, shouldRetry bool) {
    // Skip if we've already checked DNS after an error
    if values.GetDNSCheckedAfterError() || values.GetEmergencyMode() {
        Log.Debug("Skipping DNS check - already checked or in emergency mode")
        return false, false  // ← Returns "don't retry" but...
    }
    
    // Mark that we've checked DNS
    values.SetDNSCheckedAfterError(true)
    
    // Check if DNS has changed and reconnect if needed
    if changed := checkDNSAndReconnect(connectionString, suppressRegister); changed {
        return true, true // DNS changed, should retry
    }
    
    // DNS hasn't changed
    Log.Warning("MDCB connection failed. DNS unchanged - check MDCB service and network.")
    return false, false  // ← DNS unchanged, says "don't retry"
}
```

**File:** `rpc/rpc_client.go:524-530`

```go
if err != nil {
    if handleRPCError(err, values.Config().ConnectionString) {
        return nil  // Retry
    }
    return err  // ← But backoff.Retry DOES retry on error!
}
```

**Problem:** Returning `err` tells backoff to retry, even though DNS didn't change!

### Issue 2: Checks DNS on EVERY Network Error

**File:** `rpc/utils.go:24-36`

```go
func isNetworkError(err error) bool {
    if err == nil {
        return false
    }
    
    errStr := err.Error()
    
    // Check for specific network-related error patterns
    return strings.Contains(errStr, "unexpected response type: <nil>. Expected *dispatcherResponse") ||
        strings.Contains(errStr, "Cannot obtain response during timeout") ||
        strings.Contains(errStr, "rpc is either down or was not configured") ||
        strings.Contains(errStr, "Cannot decode response")
}
```

**Problem:** Treats ALL these errors as "maybe DNS changed", when most of them are NOT DNS-related:
- `"Cannot obtain response during timeout"` → MDCB is down or slow, not DNS!
- `"rpc is either down or was not configured"` → Configuration issue, not DNS!
- `"Cannot decode response"` → Protocol issue, not DNS!

Only legitimate DNS-related errors:
- `dial tcp: lookup mdcb.example.com: no such host`
- `dial tcp: lookup mdcb.example.com: i/o timeout`

## Impact

### Current Behavior (v5.8.6)

When MDCB is down:
1. Request calls RPC → timeout after 30s
2. Checks DNS → unchanged
3. Returns error → **backoff retries**
4. Retry #2 → timeout after 30s
5. Returns error → **backoff retries**
6. Retry #3 → timeout after 30s
7. **Total: 90 seconds blocked**

### With Option 1 (Proactive Monitoring)

When MDCB is down:
1. Background DNS monitor detects MDCB hostname still resolves to same IP
2. Request calls RPC → timeout after 30s
3. Checks if DNS changed recently (background monitor says no)
4. Returns `backoff.Permanent(err)` → **no retries**
5. **Total: 30 seconds blocked** (just 1 attempt)

When DNS actually changes:
1. Background monitor detects IP change every 30s
2. Automatically reconnects RPC client
3. Requests use new connection immediately
4. **Zero downtime for DNS changes**

### With Option 2 (Smarter Error Handling)

When MDCB is down:
1. Request calls RPC → timeout after 30s
2. Checks DNS inline → unchanged
3. Returns `backoff.Permanent(err)` → **no retries**
4. **Total: 30 seconds blocked** (just 1 attempt)

When DNS actually changes:
1. Request calls RPC → timeout after 30s
2. Checks DNS inline → **changed!**
3. Reconnects and retries → success
4. **Total: 30-60 seconds** (1-2 attempts)

## Recommended Solution

**Implement Option 1 (Proactive DNS Monitoring)**

**Why:**
- Detects DNS changes immediately (within check interval)
- No request blocking - all done in background
- Automatic recovery without waiting for failures
- Configurable check interval
- Clean separation of concerns

**Implementation:**
1. Add `DNSMonitor` component to `rpc/dns_monitor.go`
2. Start monitor in `rpc.Connect()`
3. Monitor calls `safeReconnectRPCClient()` when IP changes
4. Remove reactive DNS check from `handleRPCError()`
5. Use `backoff.Permanent()` for non-retryable errors

**Configuration:**
```json
{
  "slave_options": {
    "use_rpc": true,
    "connection_string": "mdcb.example.com:9091",
    "dns_monitor": {
      "enabled": true,
      "check_interval": 30
    }
  }
}
```

## Testing Plan

### Test 1: DNS Change Detection

```go
func TestDNSMonitorDetectsChange(t *testing.T) {
    // Start with DNS pointing to 192.168.1.10
    monitor := &DNSMonitor{
        hostname: "mdcb.local",
        checkInterval: 1 * time.Second,
    }
    
    changedChan := make(chan bool)
    monitor.onChange = func(old, new []net.IP) {
        changedChan <- true
    }
    
    go monitor.Start(context.Background())
    
    // Change DNS to point to 192.168.1.20
    updateDNS("mdcb.local", "192.168.1.20")
    
    // Wait for detection (should be within check interval)
    select {
    case <-changedChan:
        // Success!
    case <-time.After(5 * time.Second):
        t.Fatal("DNS change not detected")
    }
}
```

### Test 2: No Unnecessary Retries

```go
func TestNoRetriesWhenDNSUnchanged(t *testing.T) {
    // Stop MDCB
    mdcb.Stop()
    
    // Make request
    start := time.Now()
    _, err := FuncClientSingleton("GetSession", "org123")
    duration := time.Since(start)
    
    // Should fail after 1 attempt only
    assert.Error(t, err)
    assert.Less(t, duration, 35*time.Second, "Should not retry")
}
```

### Test 3: Automatic Reconnection on DNS Change

```go
func TestAutoReconnectOnDNSChange(t *testing.T) {
    // Start with MDCB at 192.168.1.10
    Connect(...)
    
    // Make successful request
    _, err := FuncClientSingleton("Login", apiKey)
    assert.NoError(t, err)
    
    // Move MDCB to 192.168.1.20 and update DNS
    mdcb.MoveTo("192.168.1.20")
    updateDNS("mdcb.local", "192.168.1.20")
    
    // Wait for monitor to detect and reconnect
    time.Sleep(2 * time.Second)
    
    // Next request should work
    _, err = FuncClientSingleton("GetSession", "org123")
    assert.NoError(t, err)
}
```

## Related Issues

- #7428 - Blocking RPC calls in request path
- #7429 - EnforceOrgDataAge configuration issue
- #7432 - Architectural problem of blocking RPC calls

This issue specifically addresses the **inefficient DNS change detection** that makes #7432 worse by causing unnecessary retries.

## Migration Path

1. **v5.8.x patch:** Use `backoff.Permanent()` to prevent retries when DNS unchanged (quick fix)
2. **v5.9.0:** Implement proactive DNS monitoring
3. **v5.9.x:** Add configuration options for DNS monitor

## References

### Current Implementation Files

- `rpc/rpc_client.go:516-541` - RPC call with error handling
- `rpc/rpc_client.go:543-561` - handleRPCError function
- `rpc/utils.go:24-36` - isNetworkError (too broad)
- `rpc/dns_resolver.go:86-106` - checkAndHandleDNSChange (reactive)

### New Files Needed

- `rpc/dns_monitor.go` - Proactive DNS monitoring component

---

**Bottom line:** The current approach is reactive and inefficient. We should monitor DNS changes proactively in the background, not during request failures. This will eliminate unnecessary retries and provide faster recovery when DNS actually changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inefficient DNS change detection causes unnecessary RPC retries and request blocking #7433

Summary

Current (Inefficient) Approach

Flow

Problems

Proposed Better Approach

Option 1: Proactive DNS Monitoring (Recommended)

Option 2: Smarter Error Handling

Current Code Issues

Issue 1: Retries Even When DNS Unchanged

Issue 2: Checks DNS on EVERY Network Error

Impact

Current Behavior (v5.8.6)

With Option 1 (Proactive Monitoring)

With Option 2 (Smarter Error Handling)

Recommended Solution

Testing Plan

Test 1: DNS Change Detection

Test 2: No Unnecessary Retries

Test 3: Automatic Reconnection on DNS Change

Related Issues

Migration Path

References

Current Implementation Files

New Files Needed

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inefficient DNS change detection causes unnecessary RPC retries and request blocking #7433

Description

Summary

Current (Inefficient) Approach

Flow

Problems

Proposed Better Approach

Option 1: Proactive DNS Monitoring (Recommended)

Option 2: Smarter Error Handling

Current Code Issues

Issue 1: Retries Even When DNS Unchanged

Issue 2: Checks DNS on EVERY Network Error

Impact

Current Behavior (v5.8.6)

With Option 1 (Proactive Monitoring)

With Option 2 (Smarter Error Handling)

Recommended Solution

Testing Plan

Test 1: DNS Change Detection

Test 2: No Unnecessary Retries

Test 3: Automatic Reconnection on DNS Change

Related Issues

Migration Path

References

Current Implementation Files

New Files Needed

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions