Open-Deep-ML · akshan-main · Mar 18, 2026 · Mar 19, 2026 · moe18 · Mar 19, 2026
diff --git a/build/457.json b/build/457.json
@@ -0,0 +1,54 @@
+{
+  "id": "457",
+  "title": "Implement Sigmoid MoE Router with Bias Correction",
+  "difficulty": "hard",
+  "category": "Deep Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [],
+  "pytorch_difficulty": "hard",
+  "description": "## Problem\n\nWrite a Python function `sigmoid_moe_router(hidden_states, gate_weight, score_bias, top_k)` that implements the sigmoid-based Mixture of Experts routing used in MiniMax M2.5. The function takes:\n- `hidden_states`: array of shape `(num_tokens, hidden_dim)`\n- `gate_weight`: array of shape `(num_experts, hidden_dim)`\n- `score_bias`: array of shape `(num_experts,)` for expert load balancing\n- `top_k`: number of experts to select per token\n\nThe function should:\n1. Compute router logits via matrix multiplication\n2. Apply sigmoid activation (not softmax) to get routing weights\n3. Add the score bias to determine expert selection\n4. Select the top-k experts per token\n5. Gather the actual sigmoid weights (without bias) for selected experts\n6. Normalize the selected weights to sum to 1\n\nReturn a tuple of `(top_k_weights, top_k_indices)` where weights has shape `(num_tokens, top_k)` and indices has shape `(num_tokens, top_k)`. Only use numpy.",
+  "learn_section": "# **Sigmoid MoE Router with Bias Correction**\n\n## **1. Definition**\nThe Sigmoid MoE Router is the expert routing mechanism used in MiniMax M2.5. Unlike traditional MoE routers that use softmax scoring, this router uses **sigmoid activation** with a **learned bias correction** for expert selection, followed by weight normalization.\n\n## **2. Why Sigmoid Instead of Softmax?**\n- **Independent scoring:** Sigmoid scores each expert independently, while softmax creates competition between experts. This allows for more flexible expert utilization.\n- **Better load balancing:** The learned bias correction term helps distribute tokens more evenly across experts without auxiliary losses dominating training.\n- **Simpler gradient flow:** Sigmoid gradients don't depend on other experts' scores.\n\n## **3. Algorithm**\nGiven hidden states $H \\in \\mathbb{R}^{T \\times d}$, gate weights $W_g \\in \\mathbb{R}^{E \\times d}$, and bias correction $b \\in \\mathbb{R}^{E}$:\n\n**Step 1: Compute logits**\n$$\\text{logits} = H \\cdot W_g^T \\quad \\in \\mathbb{R}^{T \\times E}$$\n\n**Step 2: Apply sigmoid**\n$$w = \\sigma(\\text{logits}) = \\frac{1}{1 + e^{-\\text{logits}}} \\quad \\in \\mathbb{R}^{T \\times E}$$\n\n**Step 3: Bias-corrected scores for selection**\n$$s = w + b \\quad \\in \\mathbb{R}^{T \\times E}$$\n\n**Step 4: Top-k selection**\n$$\\text{indices} = \\text{argsort}(s, \\text{descending})[:, :k]$$\n\n**Step 5: Gather actual weights (without bias)**\n$$w_{\\text{selected}} = \\text{gather}(w, \\text{indices})$$\n\n**Step 6: Normalize**\n$$\\hat{w} = \\frac{w_{\\text{selected}}}{\\sum_{j=1}^{k} w_{\\text{selected}, j}}$$\n\n## **4. Key Design Choices**\n- The **bias is only used for selection**, not for the final weights. This decouples load balancing from the actual routing weights.\n- The bias $b$ is a **learned parameter** that adjusts which experts are preferred, compensating for natural imbalances.\n- The final weights are **normalized** so they sum to 1, making the weighted combination of expert outputs a proper convex combination.\n\n## **5. Role in MiniMax M2.5**\n- **256 experts** per layer, **top-8** selected per token\n- Gate: `Linear(3072, 256, bias=False)`\n- Bias correction: learned vector of size 256\n- Only ~3% of experts activated per token (8/256)",
+  "starter_code": "import numpy as np\n\n# Implement your function below.\ndef sigmoid_moe_router(hidden_states, gate_weight, score_bias, top_k):\n    \"\"\"\n    Implement sigmoid-based MoE routing with bias correction.\n\n    Args:\n        hidden_states (np.ndarray): Token representations, shape (num_tokens, hidden_dim).\n        gate_weight (np.ndarray): Gate projection weights, shape (num_experts, hidden_dim).\n        score_bias (np.ndarray): Learned bias for load balancing, shape (num_experts,).\n        top_k (int): Number of experts to select per token.\n\n    Returns:\n        tuple: (top_k_weights, top_k_indices)\n            - top_k_weights: Normalized routing weights, shape (num_tokens, top_k).\n            - top_k_indices: Selected expert indices, shape (num_tokens, top_k).\n    \"\"\"\n    pass",
+  "solution": "import numpy as np\n\ndef sigmoid_moe_router(hidden_states, gate_weight, score_bias, top_k):\n    logits = hidden_states @ gate_weight.T\n    routing_weights = 1.0 / (1.0 + np.exp(-logits))\n    scores_for_choice = routing_weights + score_bias\n\n    top_k_indices = np.argsort(-scores_for_choice, axis=-1)[:, :top_k]\n\n    top_k_weights = np.take_along_axis(routing_weights, top_k_indices, axis=-1)\n    top_k_weights = top_k_weights / np.sum(top_k_weights, axis=-1, keepdims=True)\n\n    return top_k_weights.astype(float), top_k_indices",
+  "example": {
+    "input": "import numpy as np\nhidden = np.array([[1.0, 0.0], [0.0, 1.0]])\ngate = np.array([[1.0, 0.0], [0.0, 1.0], [-1.0, 0.0], [0.0, -1.0]])\nbias = np.zeros(4)\nw, idx = sigmoid_moe_router(hidden, gate, bias, 2)\nprint(np.round(w, 4))\nprint(idx)",
+    "output": "[[0.5938 0.4062]\n [0.5938 0.4062]]\n[[0 1]\n [1 0]]",
+    "reasoning": "For token [1,0]: logits=[1,0,-1,0], sigmoid=[0.731,0.5,0.269,0.5]. Top-2 by score are experts 0,1 with weights [0.731,0.5]. Normalized: [0.731/1.231, 0.5/1.231] = [0.5938, 0.4062]."
+  },
+  "test_cases": [
+    {
+      "test": "import numpy as np\nhidden = np.array([[1.0, 0.0], [0.0, 1.0]])\ngate = np.array([[1.0, 0.0], [0.0, 1.0], [-1.0, 0.0], [0.0, -1.0]])\nbias = np.zeros(4)\nw, idx = sigmoid_moe_router(hidden, gate, bias, 2)\nprint(np.round(w, 4))\nprint(idx)",
+      "expected_output": "[[0.5938 0.4062]\n [0.5938 0.4062]]\n[[0 1]\n [1 0]]"
+    },
+    {
+      "test": "import numpy as np\nhidden = np.array([[0.0, 0.0]])\ngate = np.array([[1.0, 0.0], [0.0, 1.0], [-1.0, 0.0]])\nbias = np.array([0.0, 0.0, 1.0])\nw, idx = sigmoid_moe_router(hidden, gate, bias, 2)\nprint(np.round(w, 4))\nprint(idx)",
+      "expected_output": "[[0.5 0.5]]\n[[2 0]]"
+    },
+    {
+      "test": "import numpy as np\nnp.random.seed(42)\nhidden = np.random.randn(2, 4)\ngate = np.random.randn(6, 4)\nbias = np.array([0.1, -0.1, 0.2, -0.2, 0.0, 0.0])\nw, idx = sigmoid_moe_router(hidden, gate, bias, 3)\nprint(np.round(w, 4))\nprint(idx)",
+      "expected_output": "[[0.6001 0.2588 0.1412]\n [0.6644 0.2488 0.0868]]\n[[5 4 0]\n [5 0 2]]"
+    },
+    {
+      "test": "import numpy as np\nhidden = np.array([[1.0, 1.0]])\ngate = np.array([[1.0, 1.0], [2.0, 2.0], [0.0, 0.0]])\nbias = np.zeros(3)\nw, idx = sigmoid_moe_router(hidden, gate, bias, 1)\nprint(np.round(w, 4))\nprint(idx)",
+      "expected_output": "[[1.]]\n[[1]]"
+    }
+  ],
+  "pytorch_starter_code": "import torch\n\ndef sigmoid_moe_router(hidden_states, gate_weight, score_bias, top_k):\n    \"\"\"\n    Implement sigmoid-based MoE routing with bias correction.\n\n    Args:\n        hidden_states (torch.Tensor): Token representations, shape (num_tokens, hidden_dim).\n        gate_weight (torch.Tensor): Gate projection weights, shape (num_experts, hidden_dim).\n        score_bias (torch.Tensor): Learned bias for load balancing, shape (num_experts,).\n        top_k (int): Number of experts to select per token.\n\n    Returns:\n        tuple: (top_k_weights, top_k_indices)\n    \"\"\"\n    pass",
+  "pytorch_solution": "import torch\n\ndef sigmoid_moe_router(hidden_states, gate_weight, score_bias, top_k):\n    logits = hidden_states @ gate_weight.T\n    routing_weights = torch.sigmoid(logits)\n    scores_for_choice = routing_weights + score_bias\n\n    top_k_indices = torch.topk(scores_for_choice, top_k, dim=-1).indices\n\n    top_k_weights = torch.gather(routing_weights, 1, top_k_indices)\n    top_k_weights = top_k_weights / top_k_weights.sum(dim=-1, keepdim=True)\n\n    return top_k_weights.float(), top_k_indices",
+  "pytorch_test_cases": [
+    {
+      "test": "import torch\nhidden = torch.tensor([[1.0, 0.0], [0.0, 1.0]])\ngate = torch.tensor([[1.0, 0.0], [0.0, 1.0], [-1.0, 0.0], [0.0, -1.0]])\nbias = torch.zeros(4)\nw, idx = sigmoid_moe_router(hidden, gate, bias, 2)\nprint(torch.round(w, decimals=4))\nprint(idx)",
+      "expected_output": "tensor([[0.5938, 0.4062],\n        [0.5938, 0.4062]])\ntensor([[0, 1],\n        [1, 0]])"
+    },
+    {
+      "test": "import torch\nhidden = torch.tensor([[0.0, 0.0]])\ngate = torch.tensor([[1.0, 0.0], [0.0, 1.0], [-1.0, 0.0]])\nbias = torch.tensor([0.0, 0.0, 1.0])\nw, idx = sigmoid_moe_router(hidden, gate, bias, 2)\nprint(torch.round(w, decimals=4))\nprint(idx)",
+      "expected_output": "tensor([[0.5000, 0.5000]])\ntensor([[2, 0]])"
+    },
+    {
+      "test": "import torch\nhidden = torch.tensor([[1.0, 1.0]])\ngate = torch.tensor([[1.0, 1.0], [2.0, 2.0], [0.0, 0.0]])\nbias = torch.zeros(3)\nw, idx = sigmoid_moe_router(hidden, gate, bias, 1)\nprint(torch.round(w, decimals=4))\nprint(idx)",
+      "expected_output": "tensor([[1.]])\ntensor([[1]])"
+    }
+  ]
+}
diff --git a/build/458.json b/build/458.json
@@ -0,0 +1,58 @@
+{
+  "id": "458",
+  "title": "Implement Lightning Attention (Linear Attention)",
+  "difficulty": "hard",
+  "category": "Deep Learning",
+  "video": "",
+  "likes": "0",
+  "dislikes": "0",
+  "contributor": [],
+  "pytorch_difficulty": "hard",
+  "description": "## Problem\n\nWrite a Python function `lightning_attention(Q, K, V, decay)` that implements causal linear attention with exponential decay, as used in Lightning Attention. The function takes:\n- `Q`: query array of shape `(seq_len, head_dim)`\n- `K`: key array of shape `(seq_len, head_dim)`\n- `V`: value array of shape `(seq_len, head_dim)`\n- `decay`: a float decay factor (lambda) between 0 and 1\n\nInstead of computing softmax(QK^T)V, compute the output using the recurrent form:\n- Maintain a state `S` of shape `(head_dim, head_dim)` initialized to zeros\n- At each timestep t: `S_t = decay * S_{t-1} + K_t^T @ V_t`, then `O_t = Q_t @ S_t`\n\nReturn the output array of shape `(seq_len, head_dim)` as floats. Only use numpy.",
+  "learn_section": "# **Lightning Attention (Linear Attention with Decay)**\n\n## **1. Definition**\nLightning Attention is a linear attention mechanism that replaces the quadratic softmax attention with a recurrent formulation. It achieves $O(nd^2)$ complexity instead of $O(n^2d)$, making it efficient for very long sequences. It was developed by the MiniMax team and used in MiniMax-01.\n\n## **2. Standard Attention vs Linear Attention**\n\n**Standard (Softmax) Attention:**\n$$O = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d}}\\right) V \\quad \\in O(n^2 d)$$\n\n**Linear Attention (kernel trick):**\n$$O_t = Q_t \\cdot \\sum_{s \\leq t} K_s^T V_s = Q_t \\cdot S_t \\quad \\in O(n d^2)$$\n\nWhere $S_t = \\sum_{s \\leq t} K_s^T V_s$ is a running state that accumulates key-value outer products.\n\n## **3. Adding Exponential Decay**\nTo prevent the state from growing unboundedly and to focus on more recent tokens, Lightning Attention adds an exponential decay factor $\\lambda$:\n\n$$S_t = \\lambda \\cdot S_{t-1} + K_t^T V_t$$\n$$O_t = Q_t \\cdot S_t$$\n\nWhere:\n- $S_t \\in \\mathbb{R}^{d \\times d}$ is the recurrent state\n- $\\lambda \\in (0, 1)$ is the decay factor\n- $K_t^T V_t$ is the outer product of key and value at position $t$\n\n## **4. Recurrent Interpretation**\nThe decay creates an exponentially weighted sum over history:\n\n$$S_t = \\sum_{s=1}^{t} \\lambda^{t-s} K_s^T V_s$$\n\nMore recent tokens contribute more strongly than distant ones, providing a natural notion of locality.\n\n## **5. Advantages**\n- **Linear complexity:** $O(nd^2)$ instead of $O(n^2d)$ — better for long sequences when $n \\gg d$\n- **Constant memory per step:** Only need to maintain the $d \\times d$ state matrix\n- **Streamable:** Can process tokens one at a time in a recurrent fashion\n- **Infinite context (in theory):** No fixed context window limitation\n\n## **6. Role in MiniMax Architecture**\nIn MiniMax-01, the predecessor to M2.5, Lightning Attention was used in a hybrid pattern: 7 linear attention layers followed by 1 softmax attention layer, repeated across the network. This combined the efficiency of linear attention with the expressiveness of softmax attention for long-range dependencies.",
+  "starter_code": "import numpy as np\n\n# Implement your function below.\ndef lightning_attention(Q, K, V, decay):\n    \"\"\"\n    Implement causal linear attention with exponential decay.\n\n    Args:\n        Q (np.ndarray): Query array of shape (seq_len, head_dim).\n        K (np.ndarray): Key array of shape (seq_len, head_dim).\n        V (np.ndarray): Value array of shape (seq_len, head_dim).\n        decay (float): Exponential decay factor (lambda), between 0 and 1.\n\n    Returns:\n        np.ndarray: Output array of shape (seq_len, head_dim).\n    \"\"\"\n    pass",
+  "solution": "import numpy as np\n\ndef lightning_attention(Q, K, V, decay):\n    seq_len, head_dim = Q.shape\n    S = np.zeros((head_dim, head_dim))\n    output = np.zeros((seq_len, head_dim))\n\n    for t in range(seq_len):\n        S = decay * S + np.outer(K[t], V[t])\n        output[t] = Q[t] @ S\n\n    return output.astype(float)",
+  "example": {
+    "input": "import numpy as np\nQ = np.ones((3, 2))\nK = np.ones((3, 2))\nV = np.ones((3, 2))\nprint(np.round(lightning_attention(Q, K, V, 0.5), 4))",
+    "output": "[[2.  2. ]\n [3.  3. ]\n [3.5 3.5]]",
+    "reasoning": "At t=0: S = outer([1,1],[1,1]) = [[1,1],[1,1]], O = [1,1]@S = [2,2]. At t=1: S = 0.5*S + outer = [[1.5,1.5],[1.5,1.5]], O = [3,3]. At t=2: S = 0.5*S + outer = [[1.75,1.75],[1.75,1.75]], O = [3.5,3.5]."
+  },
+  "test_cases": [
+    {
+      "test": "import numpy as np\nQ = np.array([[1.0, 0.0]])\nK = np.array([[1.0, 0.0]])\nV = np.array([[1.0, 2.0]])\nprint(np.round(lightning_attention(Q, K, V, 0.9), 4))",
+      "expected_output": "[[1. 2.]]"
+    },
+    {
+      "test": "import numpy as np\nQ = np.array([[1.0, 0.0], [0.0, 1.0]])\nK = np.array([[1.0, 0.0], [0.0, 1.0]])\nV = np.array([[1.0, 2.0], [3.0, 4.0]])\nprint(np.round(lightning_attention(Q, K, V, 1.0), 4))",
+      "expected_output": "[[1. 2.]\n [3. 4.]]"
+    },
+    {
+      "test": "import numpy as np\nQ = np.ones((3, 2))\nK = np.ones((3, 2))\nV = np.ones((3, 2))\nprint(np.round(lightning_attention(Q, K, V, 0.5), 4))",
+      "expected_output": "[[2.  2. ]\n [3.  3. ]\n [3.5 3.5]]"
+    },
+    {
+      "test": "import numpy as np\nnp.random.seed(42)\nQ = np.random.randn(4, 3)\nK = np.random.randn(4, 3)\nV = np.random.randn(4, 3)\nprint(np.round(lightning_attention(Q, K, V, 0.9), 4))",
+      "expected_output": "[[ 0.3988 -0.0812  0.8431]\n [-0.8582  0.538  -1.0621]\n [ 1.4379 -4.9831  0.7769]\n [-0.9745 -0.3103 -2.1484]]"
+    },
+    {
+      "test": "import numpy as np\nQ = np.array([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nK = np.array([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nV = np.array([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nprint(np.round(lightning_attention(Q, K, V, 0.0), 4))",
+      "expected_output": "[[1. 0.]\n [1. 0.]\n [1. 0.]]"
+    }
+  ],
+  "pytorch_starter_code": "import torch\n\ndef lightning_attention(Q, K, V, decay):\n    \"\"\"\n    Implement causal linear attention with exponential decay.\n\n    Args:\n        Q (torch.Tensor): Query tensor of shape (seq_len, head_dim).\n        K (torch.Tensor): Key tensor of shape (seq_len, head_dim).\n        V (torch.Tensor): Value tensor of shape (seq_len, head_dim).\n        decay (float): Exponential decay factor (lambda), between 0 and 1.\n\n    Returns:\n        torch.Tensor: Output tensor of shape (seq_len, head_dim).\n    \"\"\"\n    pass",
+  "pytorch_solution": "import torch\n\ndef lightning_attention(Q, K, V, decay):\n    seq_len, head_dim = Q.shape\n    S = torch.zeros((head_dim, head_dim), dtype=Q.dtype)\n    output = torch.zeros_like(Q)\n\n    for t in range(seq_len):\n        S = decay * S + torch.outer(K[t], V[t])\n        output[t] = Q[t] @ S\n\n    return output.float()",
+  "pytorch_test_cases": [
+    {
+      "test": "import torch\nQ = torch.tensor([[1.0, 0.0]])\nK = torch.tensor([[1.0, 0.0]])\nV = torch.tensor([[1.0, 2.0]])\nprint(torch.round(lightning_attention(Q, K, V, 0.9), decimals=4))",
+      "expected_output": "tensor([[1., 2.]])"
+    },
+    {
+      "test": "import torch\nQ = torch.ones((3, 2))\nK = torch.ones((3, 2))\nV = torch.ones((3, 2))\nprint(torch.round(lightning_attention(Q, K, V, 0.5), decimals=4))",
+      "expected_output": "tensor([[2.0000, 2.0000],\n        [3.0000, 3.0000],\n        [3.5000, 3.5000]])"
+    },
+    {
+      "test": "import torch\nQ = torch.tensor([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nK = torch.tensor([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nV = torch.tensor([[1.0, 0.0], [1.0, 0.0], [1.0, 0.0]])\nprint(torch.round(lightning_attention(Q, K, V, 0.0), decimals=4))",
+      "expected_output": "tensor([[1., 0.],\n        [1., 0.],\n        [1., 0.]])"
+    }
+  ]
+}