|
| 1 | +## Bug: PR #17171 introduces circular import deadlock that prevents proxy server from starting |
| 2 | + |
| 3 | +### Description |
| 4 | + |
| 5 | +PR #17171 ("Lazy-load utils to reduce memory + import time") introduced a circular import deadlock that prevents the FastAPI proxy server from starting. This causes health check failures in containerized deployments (ECS, Kubernetes, etc.) because Uvicorn never initializes the HTTP server. |
| 6 | + |
| 7 | +### Symptoms |
| 8 | + |
| 9 | +- ✅ Container starts successfully |
| 10 | +- ✅ Database migrations complete |
| 11 | +- ✅ APScheduler background jobs start |
| 12 | +- ❌ **Uvicorn HTTP server never starts** (no "Started server process" or "Uvicorn running" logs) |
| 13 | +- ❌ Health check endpoints (`/health`, `/health/liveness`) unreachable |
| 14 | +- ❌ Container marked as UNHEALTHY and terminated after repeated failures |
| 15 | + |
| 16 | +### Root Cause |
| 17 | + |
| 18 | +The lazy loading system in `litellm/__init__.py` creates a circular import deadlock: |
| 19 | + |
| 20 | +```python |
| 21 | +# Import chain that causes the deadlock: |
| 22 | + |
| 23 | +1. proxy_server.py:56 → from litellm.utils import load_credentials_from_list |
| 24 | + ↓ |
| 25 | +2. utils.py:56 → import litellm |
| 26 | + ↓ |
| 27 | +3. litellm/__init__.py → Sets up __getattr__ for lazy loading |
| 28 | + ↓ |
| 29 | +4. [later] Code accesses litellm.ModelResponse |
| 30 | + ↓ |
| 31 | +5. __getattr__("ModelResponse") → _lazy_import_utils("ModelResponse") |
| 32 | + ↓ |
| 33 | +6. Tries: from .utils import ModelResponse |
| 34 | + ↓ |
| 35 | +7. DEADLOCK: utils.py is still being imported from step 2! |
| 36 | +``` |
| 37 | + |
| 38 | +**Why this hangs:** When `proxy_server.py` imports from `litellm.utils` before `litellm` finishes initializing, Python starts loading `utils.py`. When `utils.py` imports `litellm`, the module sets up lazy loading via `__getattr__`. Later, when code accesses `litellm.ModelResponse`, the `__getattr__` handler tries to import from `utils` again—but `utils.py` is still being loaded from step 2, creating an infinite wait. |
| 39 | + |
| 40 | +### Affected Files |
| 41 | + |
| 42 | +The circular dependency involves: |
| 43 | +- `litellm/__init__.py` (added `__getattr__` lazy loading) |
| 44 | +- `litellm/_lazy_imports.py` (new file, handles deferred imports) |
| 45 | +- `litellm/utils.py` (imports `litellm` at line 56) |
| 46 | +- `litellm/proxy/proxy_server.py` (imports from `litellm.utils` at line 56) |
| 47 | + |
| 48 | +### Reproduction |
| 49 | + |
| 50 | +1. Deploy litellm proxy with commit `56328e6535` or later |
| 51 | +2. Start the container |
| 52 | +3. Observe that migrations complete but Uvicorn never starts |
| 53 | +4. Health checks fail → container terminated |
| 54 | + |
| 55 | +OR locally: |
| 56 | + |
| 57 | +```bash |
| 58 | +# This will hang indefinitely with the lazy loading: |
| 59 | +python -m litellm.proxy.proxy_cli --port 4000 |
| 60 | +``` |
| 61 | + |
| 62 | +### Workaround/Fix |
| 63 | + |
| 64 | +The fix is to revert the lazy loading system and restore direct imports in `litellm/__init__.py`, **ensuring ALL functions imported by `proxy_server.py` are included**: |
| 65 | + |
| 66 | +```python |
| 67 | +# Replace lazy loading __getattr__ with direct imports: |
| 68 | +from .utils import ( |
| 69 | + client, |
| 70 | + exception_type, |
| 71 | + get_optional_params, |
| 72 | + # ... all other utils functions |
| 73 | + ModelResponse, |
| 74 | + ModelResponseStream, |
| 75 | + load_credentials_from_list, # CRITICAL: Used by proxy_server.py:56 |
| 76 | + _add_custom_logger_callback_to_specific_event, # CRITICAL: Used by proxy_server.py:464 |
| 77 | + # etc. |
| 78 | +) |
| 79 | + |
| 80 | +from .cost_calculator import completion_cost, cost_per_token, response_cost_calculator |
| 81 | +from litellm.litellm_core_utils.litellm_logging import Logging, modify_integration |
| 82 | + |
| 83 | +# Remove the __getattr__ function entirely |
| 84 | +``` |
| 85 | + |
| 86 | +**Note**: The initial fix (commit `f6bc8d2f62`) was incomplete because it didn't include `load_credentials_from_list` and `_add_custom_logger_callback_to_specific_event` in the import list, causing the circular import to persist when `proxy_server.py` imported these functions. |
| 87 | + |
| 88 | +### Related Commits |
| 89 | + |
| 90 | +- **Breaking change:** `56328e6535` - PR #17171 (Dec 3, 2025) |
| 91 | +- **Incomplete fix:** `f6bc8d2f62` - "fix(agentcore): remove lazy loading to resolve circular import deadlock" (missing 2 imports) |
| 92 | +- **Complete fix:** TBD - Added `load_credentials_from_list` and `_add_custom_logger_callback_to_specific_event` to import list |
| 93 | + |
| 94 | +### Why the Initial Fix Failed |
| 95 | + |
| 96 | +The first fix (commit `f6bc8d2f62`) restored direct imports but was incomplete. When `proxy_server.py` imported at line 56: |
| 97 | +```python |
| 98 | +from litellm.utils import load_credentials_from_list |
| 99 | +``` |
| 100 | + |
| 101 | +Since `load_credentials_from_list` wasn't in the pre-imported list in `litellm/__init__.py`, Python had to dynamically import it from `utils.py`, which triggered `import litellm` at `utils.py:56`, recreating the circular dependency. |
| 102 | + |
| 103 | +### Impact |
| 104 | + |
| 105 | +- **Severity:** Critical - Proxy server cannot start |
| 106 | +- **Affected deployments:** All containerized deployments with health checks (ECS, Kubernetes, Docker) |
| 107 | +- **Version:** All versions after Dec 3, 2025 (commit `56328e6535`) |
| 108 | + |
| 109 | +### Suggested Solution |
| 110 | + |
| 111 | +1. **Short-term:** Revert PR #17171 or apply the fix from commit `f6bc8d2f62` |
| 112 | +2. **Long-term:** If lazy loading is desired for performance, restructure the imports to break the circular dependency: |
| 113 | + - Move the `import litellm` statement in `utils.py` to local imports where needed |
| 114 | + - OR ensure `proxy_server.py` doesn't import from `litellm.utils` before `litellm` is fully initialized |
| 115 | + - OR use a different lazy loading mechanism that doesn't rely on `__getattr__` during module initialization |
| 116 | + |
| 117 | +### Environment |
| 118 | + |
| 119 | +- Python version: 3.12.8 |
| 120 | +- Deployment: AWS ECS (also affects any containerized deployment) |
| 121 | +- Base image: Chainguard Wolfi |
| 122 | + |
| 123 | +--- |
| 124 | + |
| 125 | +**Note:** This issue was discovered when health checks failed in production ECS deployment. The fix has been tested and confirmed to resolve the issue. |
0 commit comments