feat(mariadb): add 11.4 replication and semisync hardening by weicao · Pull Request #2633 · apecloud/kubeblocks-addons

weicao · 2026-05-09T07:06:28Z

Summary

Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources.
Version the replication script ConfigMap and wire replication and semisync ComponentDefinitions to the same versioned script name.
Harden semisync startup, role publication, switchover fencing, preStop handling, rebuilt-old-primary rejoin fencing, and fresh bootstrap role publication.
Gate semisync primary role publication on both user-facing root and internal-root local writes while read_only=0.
Keep remote root table writes fenced during secondary states while preserving the privileges needed for follow and status repair, including REPLICATION MASTER ADMIN.
Truncate the switchover action contract to the kbagent 60s ceiling: action keeps DCS switchover, old primary local read_only fence, and a bounded candidate remote root write probe; post-DCS convergence is delegated to roleProbe + KB endpoint controller.
Move kubeblocks.kb_health_check 1062/1146 repair into the secondary roleProbe path with a precise signature trigger and kb_internal_root maintenance writes.
Add shell specs for replication member join, role probe, switchover, semisync rejoin fencing, and standalone template mapping.

Local validation

git diff --check
bash -n addons/mariadb/scripts/replication-switchover.sh
bash -n addons/mariadb/scripts/replication-roleprobe.sh
helm lint addons/mariadb
helm template addons/mariadb (rendered switchover.timeoutSeconds = 60 in cmpd-semisync and cmpd-replication; rendered scripts contain secondary_kb_health_check_repair_attempt and wait_candidate_remote_root_write_ready)
shellspec -I shellspec addons/mariadb/scripts-ut-spec passed: 145 examples, 0 failures
PR body and branch commits were checked for attribution text.

Current retest package

Latest commit: 77bfef3e.
Chart: 1.1.1-alpha.69 (alpha.69 v1 — chart bump from 1.1.1-alpha.68 because the alpha.69 v1 fix mutates cmpd-semisync.yaml ensure_internal_local_admin SQL with 3 new statements: CREATE DATABASE IF NOT EXISTS kubeblocks + CREATE TABLE IF NOT EXISTS kubeblocks.kb_health_check BEFORE the @'%' GRANT block (closes Error 1146 bootstrap precondition); plus GRANT SELECT ON mysql.user TO 'kb_internal_root'@'%' AFTER the existing three @'%' grants (closes Error 1044 syncer init_db handshake against the /mysql default DB in syncer's connection DSN). KubeBlocks ComponentDefinition immutability rule applies the same as alpha.64 v3 → alpha.65 → alpha.66 → alpha.67 → alpha.68. Closes Jack 17:57 alpha.68 v2 install/script live-gate RED 3 evidence chains (1146 + 1044 + 2002 downstream cross-pod listener stuck on 127.0.0.1).
merge-base for canonical alpha.59-.69 diff: git merge-base origin/main HEAD = 69b3b6d90f758c14434da0d38699827abafe8645.
alpha.69 v1-only diff sha256 (6795fd42..HEAD -- addons/mariadb): c1dda065d00332ae8a9cfd975806a1348b9b1903cdfa392c043ed0d036011905.
Whole alpha.59-.69 diff sha256 (<merge-base>..HEAD -- addons/mariadb, base = 69b3b6d9): c07b9d61455171748e57b0a8d1aa20096aba08ed10088be34b853d0f5ed48169.
Package sha256: da2cafb03fe715f982753e9f34c439d9094ceba6290faf1ed30ac0d4f6016a1a.
Rendered manifest sha256: 1caf721b5ea6969a6866fd0e754f9d062899ded5003b81180ebc6982e24620a3.
Source SHAs: Chart.yaml 73d8bc437b52470fbd17d277f5a9603e4ad8a8ef988e370ccca2dab7f6506d89 (alpha.68 → alpha.69 + cumulative immutability + bootstrap precondition + narrow init_db grant + MariaDB 11.4 SHOW GRANTS normalization comment block); templates/cmpd-semisync.yaml 9d2a8ec63ceda2c8d290cbbc5c379eaac5d905093f9769192d17e200b667081e (ensure_internal_local_admin SQL + 3 lines + extensive comment expansion); scripts-ut-spec/replication_switchover_spec.sh 27d00254b637300095e58161e23d679dd5b6bf178329488dfac8a5480d0bec67 (chart-bump regression literals updated + alpha.66 v1 SUPERSEDED allowlist regex extended to 4 grants + new Describe with 5 examples).
Static: bash -n rc=0 / dash -n rc=0 / helm lint ✓ / helm template rc=0 (rendered manifest contains name: mariadb-semisync-1.1.1-alpha.69, mariadb-replication-1.1.1-alpha.69, mariadb-galera-1.1.1-alpha.69 plus -pcr variants).
alpha.64 v1+v2+v3 + alpha.65 v1+v2 + alpha.66 v1 + alpha.67 v1 + alpha.68 v2 contract spot-check (all preserved unchanged; alpha.69 v1 only ADDS 3 SQL statements inside ensure_internal_local_admin and bumps chart version, does not modify any prior contract).
ShellSpec source-tree replication_switchover_spec.sh: 164 examples / 0 failures.
ShellSpec package-extraction口径 (Helen self-verified in fake-repo /tmp/alpha69v1-fakerepo/ mirroring real repo layout — Helen ship checklist step from alpha.65 v1→v2): 164 examples / 0 failures.
Pre-existing ShellSpec failures in other spec files (carried as alpha.70+ cleanup item; NOT caused by alpha.69 v1, baseline alpha.68 v2 has same fail count): 58/60 in semisync_rejoin_fence_template_spec.sh + 1/1 cwd path-bug in standalone_template_mapping_spec.sh. Confirmed via git stash to alpha.68 v2 baseline and re-running both spec files — same fail count.

alpha.69 v1 fix scope (closes alpha.68 v2 install/script live-gate RED 3 evidence chains: 1146 + 1044 + 2002)

alpha.68 v2 install/script live-gate came back RED at runtime (Jack 17:57 closeout msg xxxx). Three evidence chains:

Chain 1 — Error 1146 bootstrap precondition gap: ensure_internal_local_admin runs from wait_for_internal_local_admin "startup-before-role-decision", which executes BEFORE primary_local_root_write_ready can create kubeblocks.kb_health_check. On a fresh boot the table therefore does not exist when the @'%' GRANT runs, the GRANT fails with Error 1146, ensure_internal_local_admin returns rc=1, wait_for_internal_local_admin loops forever, role decision never reaches expose_sql_listener_for_*_role, mariadbd stays bound to 127.0.0.1, and cross-pod TCP connections see Error 2002 (Chain 3 = downstream consequence of Chain 1).

Chain 2 — Error 1044 mysql DB access denial: syncer's connection URL (apecloud/syncer engines/mysql/config.go line 71) is root:@tcp(127.0.0.1:3306)/mysql?multiStatements=true — the /mysql segment is the default database for go-sql-driver, which issues init_db = mysql during the TCP handshake. alpha.68 v2 grants on @'%' are global (REPLICATION CLIENT + REPLICATION MASTER ADMIN ON *.*) plus table-specific on kubeblocks.kb_health_check; the cross-pod init_db handshake fails with Error 1044 because the @'%' account has no privilege on the mysql schema. Cross-pod auth never establishes.

Chain 3 — Error 2002 cross-pod connection refused: downstream of Chain 1 (wait_for_internal_local_admin infinite loop → no role decision → no expose_sql_listener_for_*_role → mariadbd stays bound to 127.0.0.1).

Jack 18:20 alpha.69 v1 design ACCEPT with one runtime-acceptance tightening (MariaDB 11.4 SHOW GRANTS normalization: GRANT REPLICATION CLIENT ON *.* source syntax displays as BINLOG MONITOR ON *.* in SHOW GRANTS output — MariaDB 11.4 split REPLICATION CLIENT into BINLOG MONITOR + SLAVE MONITOR. Source-side ShellSpec tests check the literal source SQL; runtime live-gate SHOW GRANTS acceptance accepts BINLOG MONITOR as the positive normalized form. BINLOG MONITOR ≠ BINLOG ADMIN — the latter remains in the forbidden admin-bypass list).

alpha.69 v1 fix (chart-only; syncer source untouched):

cmpd-semisync.yaml ensure_internal_local_admin SQL @'%' section now (3 new statements; comments + body):

CREATE DATABASE IF NOT EXISTS kubeblocks;                                     -- 1146 fix: precondition
CREATE TABLE IF NOT EXISTS kubeblocks.kb_health_check(type INT,
  check_ts BIGINT, PRIMARY KEY(type));                                        -- 1146 fix: precondition
CREATE USER IF NOT EXISTS '${user}'@'%' IDENTIFIED BY '${password}';          -- existing
ALTER USER '${user}'@'%' ACCOUNT UNLOCK;                                      -- existing
REVOKE ALL PRIVILEGES, GRANT OPTION FROM '${user}'@'%';                       -- existing
GRANT REPLICATION CLIENT ON *.* TO '${user}'@'%';                             -- existing
GRANT REPLICATION MASTER ADMIN ON *.* TO '${user}'@'%';                       -- existing
GRANT SELECT, INSERT, UPDATE ON kubeblocks.kb_health_check TO '${user}'@'%';  -- existing
GRANT SELECT ON mysql.user TO '${user}'@'%';                                  -- 1044 fix: narrow init_db grant
FLUSH PRIVILEGES;

CREATE DATABASE + CREATE TABLE are idempotent and safe via ROOT_LOCAL socket which has GRANT ALL PRIVILEGES locally. primary_local_root_write_ready / primary_internal_root_write_ready / clear_local_kb_health_check_table still run later post-role-decision; all are idempotent on the table. GRANT SELECT ON mysql.user is the narrow table-specific privilege satisfying init_db; net attack-surface delta = 0 vs root@'%' which already has SELECT ON . via alpha.64 v1 CMPD_EXPLICIT_PRIMARY_GRANT_BODY (root and kb_internal_root share MARIADB_ROOT_PASSWORD).

Chart.yaml bump 1.1.1-alpha.68 → 1.1.1-alpha.69 with cumulative comment block (alpha.65 v1 + alpha.66 v1 + alpha.67 v1 + alpha.68 v2 + alpha.69 v1 rationale all preserved for audit history).

ShellSpec increments (164 examples / 0 failures in replication_switchover_spec.sh; +8 net vs alpha.68 v2 156): 3 chart-version literal regression tests (alpha.65/.66/.67 chart-bump regression) updated to assert literal 1.1.1-alpha.69. alpha.66 v1 SUPERSEDED allowlist regex extended to include 4th grant GRANT SELECT ON mysql.user. New Describe alpha.69 v1 ensure_internal_local_admin bootstrap SQL ordering + mysql.user narrow grant with 5 examples in 3 contexts:

"1146 fix — CREATE DATABASE/TABLE before @'%' GRANT" (2): function body contains CREATE DATABASE IF NOT EXISTS kubeblocks and CREATE TABLE IF NOT EXISTS kubeblocks.kb_health_check; ordering CREATE DATABASE < CREATE TABLE < CREATE USER @'%'.
"1044 fix — narrow GRANT SELECT ON mysql.user" (2): positive GRANT SELECT ON mysql.user TO '${user}'@'%'; no broader mysql.* grants (negative scope check).
"1146/1044 fix SQL ordering" (1): awk-scoped 9-step ordering inside ensure_internal_local_admin function body with comment lines filtered out (critical implementation detail because the same regex patterns appear in the function's comment block — using grep would yield false positives matching comments instead of the actual SQL statements).

alpha.69 v1 caveats:

alpha.70+ mandatory blocking debt (renamed; was alpha.69 in earlier planning, renamed because chart-only short-term unblock ships as alpha.69): syncer source change so cross-member GetDBConnWithAddr uses dedicated lower-priv credential AND removes /mysql from the connection DSN (or syncer-side mechanism replaces direct cross-pod admin SQL such as setSemiSyncSourceTimeout). alpha.70+ goal state restores kb_internal_root@'%' to alpha.67 v1 LOCKED + zero-priv (clean security boundary). alpha.69 v1 is bounded short-term unblock, NOT a final design.
Live-gate runtime acceptance (per Jack 18:20 ACCEPT + 18:24 BINLOG ADMIN vs BINLOG MONITOR clarification): new CmpD mariadb-semisync-1.1.1-alpha.69 Available; syncer log hits switch to admin db; local AdminDB CURRENT_USER() = kb_internal_root@127.0.0.1; cross-pod (pod1 → pod0 IP) AdminDB CURRENT_USER() = kb_internal_root@%; SHOW GRANTS FOR 'kb_internal_root'@'%' contains positive normalized forms only: USAGE + (BINLOG MONITOR == REPLICATION CLIENT normalized positive) + REPLICATION MASTER ADMIN + SELECT/INSERT/UPDATE on kubeblocks.kb_health_check + SELECT on mysql.user; not allowed: BINLOG ADMIN (forbidden admin bypass), ALL PRIVILEGES, SUPER, READ_ONLY ADMIN, CONNECTION ADMIN, REPLICATION SLAVE ADMIN; user-facing root @%/@localhost/@127.0.0.1 still no admin bypass; stable window 0 hit on Error 4151 / 1142 / 1044 / 1049 / 1146 / 2002 / Take the leader failed / demote failed / RoleProbeNotDone / Tier B fail_closed token; vcluster-only execution in idc/idc1/idc2/idc4.
Pre-existing ShellSpec failures in semisync_rejoin_fence_template_spec.sh (58/60) and standalone_template_mapping_spec.sh (1/1 cwd path-bug) are carried as alpha.70+ cleanup item (NOT caused by alpha.69 v1, baseline alpha.68 v2 has same fail count).
alpha.63 product code remains STILL UNVALIDATED at runtime; alpha.69 fresh switchover N=1 will be the first runtime exercise reaching alpha.63 verifier + alpha.64 v1+v2+v3 + alpha.65 chart-bump rule + alpha.66 syncer HA executor + alpha.67 v1 zero-priv enforcement + alpha.68 v2 cross-member grant allowlist + alpha.69 v1 bootstrap precondition + narrow init_db grant together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → later scope.
Preserved scenes accumulated to 11 (5 RED + 6 CmpD orphans); alpha.68 CmpD orphan adds after alpha.69 install (12 total).
Production syncer image IsRunning → switch to admin db auto-switch logic risk carry-forward from alpha.66 v1 (live gate switch to admin db log appearance is the runtime-validation gate).

alpha.68 v2 fix scope (superseded by alpha.69 v1 above for the `@'%'` bootstrap precondition + narrow init_db grant; alpha.68 v2 UNLOCK + cross-member grant allowlist preserved unchanged)

alpha.68 v2 fix scope (closes alpha.67 v1 install/script live-gate RED on cross-member syncer auth via LOCKED `@'%'`)

alpha.67 v1 install/script live-gate came back RED at runtime (Jack 15:39 closeout). alpha.67 v1 LOCKED + zero-priv kb_internal_root@'%' detection-only record correctly satisfied syncer IsAdminCreated host='%' detection, but broke cross-member syncer auth: syncer GetMemberConnection uses Admin credential (= kb_internal_root via MYSQL_ADMIN_USER) for cross-pod TCP, which authenticates via @'%'. LOCKED leads to Error 4151 Access denied; secondary cannot poll leader health; cluster stays in RoleProbeNotDone forever (1064 instances observed in alpha.67 v1 live gate).

Helen 15:53 SQL matrix audit established the cross-member exact grant requirement: IsReadonly (USAGE only), IsMemberLagging / ReadCheck (SELECT on kubeblocks.kb_health_check), IsMemberHealthy leader-only WriteCheck (INSERT/UPDATE on kubeblocks.kb_health_check), setSemiSyncSourceTimeout (Follow secondary → leader: REPLICATION MASTER ADMIN).

Jack 15:58 refined checkpoint #3: no NEW net capability vs root@'%' which already has REPLICATION MASTER ADMIN via alpha.64 v1 contract; root and kb_internal_root share MARIADB_ROOT_PASSWORD, so net attack-surface delta = 0 for REPLICATION MASTER ADMIN. Still hard-forbidden: ALL PRIVILEGES / SUPER / READ_ONLY ADMIN / CONNECTION ADMIN / BINLOG ADMIN / REPLICATION SLAVE ADMIN / DELETE / DROP / CREATE USER / schema-wide DML / CREATE on kubeblocks.*.

alpha.68 v2 changes (chart-only; syncer source untouched): cmpd-semisync.yaml ensure_internal_local_admin SQL @'%' from LOCKED+REVOKE to UNLOCK+REVOKE+3 grants. Ordering: CREATE → UNLOCK → REVOKE → GRANT REPLICATION CLIENT → GRANT REPLICATION MASTER ADMIN → GRANT SELECT, INSERT, UPDATE ON kubeblocks.kb_health_check. No CREATE grant on kubeblocks.* because primary_local_root_write_ready pre-creates the table during local primary bootstrap before role publish (this bootstrap precondition assumption later broken in alpha.68 v2 install/script live gate → alpha.69 v1 closes it).

Chart.yaml bump 1.1.1-alpha.67 → 1.1.1-alpha.68 (KB CmpD immutability rule). Cumulative comment block (alpha.65 + alpha.66 v1 + alpha.67 v1 rationale preserved); alpha.68 v2 block documents Direction B risk acceptance (NOT zero risk) explicitly, plus alpha.69 mandatory blocking debt (later renamed alpha.70+ when chart-only alpha.69 v1 ships).

ShellSpec increments (+8 net, 279 examples / 0 failures in replication_switchover_spec.sh at alpha.68 v2 ship time): renamed 3 chart-version literals to alpha.68. SUPERSEDED tests (alpha.66 v1 @'%' LOCK → "not LOCK"; zero-GRANT @'%' → allowlist exact-match — only the 3 expected GRANTs). Removed alpha.67 v1 ordering test (was CREATE → REVOKE → LOCK; alpha.68 v2 introduces new ordering). New Describe alpha.68 v2 ensure_internal_local_admin cross-member health grant allowlist with 8 examples in 3 contexts.

alpha.67 v1 fix scope (superseded by alpha.68 v2 above for cross-member syncer auth; alpha.67 v1 write-site REVOKE pattern preserved at lower priority — alpha.68 v2 still REVOKE + UNLOCK + 3 grants, alpha.69 v1 adds 4th grant)

alpha.67 v1 fix scope (closes alpha.66 v1 package-level review HOLD on @'%' zero-priv write-site contract gap)

alpha.66 v1 introduced a detection-only kb_internal_root@'%' record so syncer's IsAdminCreated (which queries mysql.user WHERE host='%') can detect the admin user and trigger the AdminDB swap in IsRunning. The security contract said this @'%' record carries ACCOUNT LOCK AND zero privileges so even if LOCK is somehow bypassed, the account cannot run any SQL. But the contract was only declarative: CREATE USER IF NOT EXISTS does NOT clear an existing account's privileges, and ACCOUNT LOCK is NOT a revoke. If kb_internal_root@'%' happened to pre-exist (misconfigured prior install or upgrade) with grants, alpha.66 v1 would lock the account but leave the privileges intact — violating the security contract.

Jack's alpha.66 v1 package-level review (Slock thread #mariadb:a09341b9 msg 97e74e30, 12:56) flagged this as a XP Class 2 (write-site contract not enforced) blocker: the ShellSpec asserted "no GRANT", but never asserted "REVOKE happens".

alpha.67 v1 fix (chart-only; cmpd-semisync.yaml + Chart.yaml + spec):

cmpd-semisync.yaml ensure_internal_local_admin SQL: insert an explicit REVOKE ALL PRIVILEGES, GRANT OPTION FROM '${user}'@'%'; between CREATE USER IF NOT EXISTS '${user}'@'%' IDENTIFIED BY '${password}'; and ALTER USER '${user}'@'%' ACCOUNT LOCK;. The pattern matches the alpha.64 v1 LOCK paths (set_local/remote_root_account_state LOCK and lock_local_root_for_prestop) which already use the same REVOKE statement before re-applying the non-bypass grant body — production-tested through alpha.64 v1+v2+v3 + alpha.65 v1+v2 + alpha.66 v1 install/script live gates without error.

Chart.yaml bump 1.1.1-alpha.66 → 1.1.1-alpha.67 with cumulative comment block (alpha.65 v1 + alpha.66 v1 + alpha.67 v1 rationale all preserved for audit history).

ShellSpec increments (272 examples / 0 failures, alpha.66 v1 → alpha.67 v1 = +4 net): renamed 2 chart-version literal tests (alpha.65 v1 + alpha.66 v1 chart-bump regression tests) to assert current 1.1.1-alpha.67. New Describe alpha.67 v1 ensure_internal_local_admin write-site zero-priv enforcement with 4 examples in 3 contexts:

"chart bump alpha.66 → alpha.67" (1): chart version is exactly 1.1.1-alpha.67.
"ensure_internal_local_admin write-site REVOKE step" (2): function body contains REVOKE ALL PRIVILEGES, GRANT OPTION FROM followed by @'%' (positive); statement ordering CREATE USER ... @'%' < REVOKE ... @'%' < ALTER USER ... @'%' ACCOUNT LOCK (positive ordering check).
"alpha.66 v1 negative + alpha.64+.65 invariants preserved" (1): function body still has zero GRANT to @'%' (negative scan retained from alpha.66 v1 — alpha.67 v1 only ADDS a REVOKE statement, it does not introduce any GRANT).

alpha.67 v1 caveats (carry forward from alpha.66 v1):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml continues to report 1 hit at line 1920 (case pattern matcher) + 2 hits in ensure_internal_local_admin (legitimate kb_internal_root@localhost + @127.0.0.1 internal exception); the new alpha.67 v1 REVOKE statement does NOT add any new "GRANT ALL PRIVILEGES" hit.
Live-gate runtime acceptance unchanged from alpha.66 v1 (per Jack 12:39 design ACCEPT 3 tightening): new CmpD mariadb-semisync-1.1.1-alpha.67 Available; syncer log hits switch to admin db; local AdminDB CURRENT_USER() = kb_internal_root@127.0.0.1; SHOW CREATE USER 'kb_internal_root'@'%' contains ACCOUNT LOCK; SHOW GRANTS FOR 'kb_internal_root'@'%' returns USAGE only / no admin priv (alpha.66 v1 declarative → alpha.67 v1 enforced via REVOKE even if pre-existing grants); user-facing root @%, @localhost, @127.0.0.1 still no admin bypass; stable window 0 hit on Error 1227, Take the leader failed, demote failed, RoleProbeNotDone; vcluster-only execution in idc/idc1/idc2/idc4.
alpha.63 product code remains STILL UNVALIDATED at runtime; alpha.67 fresh switchover N=1 will be the first runtime exercise reaching alpha.63 verifier + alpha.64 v1+v2+v3 + alpha.65 chart-bump rule + alpha.66 syncer HA executor + alpha.67 v1 zero-priv enforcement together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → later scope.
Preserved scenes accumulated to 7 (5 RED + 2 CmpD orphans); alpha.66 CmpD orphan adds after alpha.67 install (8 total).
production syncer image IsRunning -> switch to admin db auto-switch logic risk carry-forward from alpha.66 v1 (live gate switch to admin db log appearance is the runtime-validation gate).

alpha.66 v1 fix scope (superseded by alpha.67 v1 above for the @'%' zero-priv write-site enforcement; alpha.66 v1 syncer HA executor swap + chart bump path preserved)

alpha.66 v1 fix scope (closes alpha.65 v2 install/script live-gate RED on syncer HA executor privilege mismatch)

alpha.65 v2 install/script live-gate came back RED at runtime even though every static + ShellSpec gate passed (Slock thread #mariadb:a09341b9 Jack closeout msg 5889a760, 12:18). The pods reached 3/3 Running and the alpha.65 CmpD trio was Available, but the live cluster never converged: Cluster/Component stayed in Creating with Healthy=False reason=RoleProbeNotDone for ~140 seconds. mariadb log on pod0 showed a tight loop of Take the leader failed: SET GLOBAL rpl_semi_sync_slave_enabled = 0 ... Error 1227 ... REPLICATION SLAVE ADMIN and demote failed: turn on readonly failed: ... Error 1227 ... READ_ONLY ADMIN. SHOW GRANTS for root@%/127.0.0.1/localhost confirmed all three host views had no admin-bypass privileges — exactly as alpha.64 v1 / alpha.66 v1 root fence requires.

5-layer排除 ran cleanly to product layer; first blocker = product/addon executor 分工 contract gap. alpha.64 v1 correctly removed admin-bypass privileges from user-facing root, but the chart's syncer container env block (cmpd-semisync.yaml line 1854) bound KB_SERVICE_USER to MARIADB_ROOT_USER (user-facing root), and the syncer engine then used user-facing root credentials for HA Promote/Demote SQL — including SET GLOBAL rpl_semi_sync_slave_enabled = 0 (requires REPLICATION SLAVE ADMIN) and SET GLOBAL read_only = ON (requires READ_ONLY ADMIN). The fence executor and HA-maintenance executor were the same user, so fence-tightening on user-facing root broke the HA path.

Investigation of upstream apecloud/syncer source (worktree syncer-pr142) found a clean fix path that does not require either (a) restoring admin bypass to user-facing root or (b) modifying syncer source:

syncer's engines/mysql/config.go already supports a 3-tier credential model: Root (KB_SERVICE_USER), Admin (MYSQL_ADMIN_USER, falls back to Root), Replication (MYSQL_REPLICATION_USER, falls back to Admin).
syncer's engines/mysql/manager.go IsRunning() (called before every HA cycle in highavailability/ha.go) checks IsAdminCreated() and swaps mgr.DB = mgr.AdminDB once the admin user is detected. After the swap, all HA SQL (Promote / Demote / EnableSemiSync* / TurnOn|OffReadOnly in semi_sync.go + slave.go) goes through the admin executor.
IsAdminCreated() calls ListUsers() which queries mysql.user WHERE host = '%'. So the admin user must have a host='%' row in mysql.user — even if the actual admin connection from 127.0.0.1:3306 matches a more specific host.

alpha.66 v1 changes (chart-only; syncer source untouched):

cmpd-semisync.yaml env block additions (closes Jack 12:34 design-HOLD blockers 1 + 2):

MYSQL_ADMIN_USER: literal kb_internal_root (NOT a $(MARIADB_INTERNAL_ROOT_USER) env-substitution because the K8s env expansion order is not guaranteed; literal closes blocker 2).
MYSQL_ADMIN_PASSWORD: $(MARIADB_ROOT_PASSWORD) (shared with root password per the existing ensure_internal_local_admin pattern).
KB_SERVICE_USER / KB_SERVICE_PASSWORD unchanged (still bound to MARIADB_ROOT_USER / MARIADB_ROOT_PASSWORD); root remains the readiness-ping path so syncer startup is not blocked when kb_internal_root does not yet exist.

cmpd-semisync.yaml ensure_internal_local_admin SQL additions (closes Jack 12:39 design-ACCEPT tightening 3):

Existing kb_internal_root@localhost + @127.0.0.1 paths preserved verbatim (full GRANT ALL PRIVILEGES ... WITH GRANT OPTION; this is what syncer's AdminDB connection from 127.0.0.1:3306 actually authenticates against — host match priority falls on @127.0.0.1 first).
New detection-only kb_internal_root@'%' record: CREATE USER ... @'%' IDENTIFIED BY <pwd>; ALTER USER ... @'%' ACCOUNT LOCK; — intentionally zero GRANT statements. Required so syncer's IsAdminCreated() WHERE host='%' query can detect kb_internal_root and trigger the AdminDB swap, without expanding the remote attack surface (LOCK rejects remote auth; even if LOCK is somehow bypassed, the @'%' record has no privileges).

Chart.yaml bump 1.1.1-alpha.65 → 1.1.1-alpha.66 with the existing alpha.65 v1 immutability rationale comment block preserved and a new alpha.66 v1 block appended explaining the syncer HA executor swap rationale.

ShellSpec increments (268 examples / 0 failures, alpha.65 v2 → alpha.66 v1 = +10 net): renamed the alpha.65 v1 chart version bump test to assert the current literal 1.1.1-alpha.66. New Describe alpha.66 v1 syncer HA executor + chart bump with 9 examples in 4 contexts:

"chart bump for CmpD immutability" (2): chart version is exactly 1.1.1-alpha.66; appVersion still contains 11.4.10.
"syncer executor contract" (3): chart env contains MYSQL_ADMIN_USER literal kb_internal_root (NOT env-substitution); chart env contains MYSQL_ADMIN_PASSWORD referencing MARIADB_ROOT_PASSWORD; chart env still contains KB_SERVICE_USER referencing MARIADB_ROOT_USER (poll/readiness path unchanged).
"detection-only @'%' record contract" (4): ensure_internal_local_admin body creates @'%' with IDENTIFIED BY; body locks @'%' via ACCOUNT LOCK; body has zero GRANT to @'%' (negative awk-block scan inside the function); body retains GRANT ALL PRIVILEGES to @localhost AND @127.0.0.1 (internal exception preserved for syncer's 127.0.0.1 AdminDB connection).
"alpha.64+.65 contract no-regression spot-check" (1): invariant counts equal 1 1 4 1 16 2.

alpha.66 v1 caveats (carry forward + new):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml continues to report 1 hit at line 1920 (case pattern matcher inside the prestop watchdog block, NOT an active GRANT statement); the new ensure_internal_local_admin paths add 2 more hits which are the legitimate kb_internal_root@localhost + @127.0.0.1 internal exception.
Live-gate runtime acceptance (per Jack 12:39 design ACCEPT 3 tightening): new CmpD mariadb-semisync-1.1.1-alpha.66 Available; syncer log hits switch to admin db; local AdminDB proof CURRENT_USER() = kb_internal_root@127.0.0.1; SHOW CREATE USER 'kb_internal_root'@'%' contains ACCOUNT LOCK; SHOW GRANTS FOR 'kb_internal_root'@'%' returns USAGE only / no admin privilege; user-facing root @%, @localhost, @127.0.0.1 still no admin bypass; stable window 0 hit on Error 1227 (REPLICATION SLAVE ADMIN / READ_ONLY ADMIN), Take the leader failed, demote failed, RoleProbeNotDone; vcluster-only execution in idc/idc1/idc2/idc4 (per westonnnn 12:18 directive).
alpha.63 product code (alpha.63 v1+v2 verifier impl + grant_option_residual + I-2 multi-line 1044 probe extraction) remains STILL UNVALIDATED at runtime; alpha.66 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches alpha.63 verifier + alpha.64 v1 grant body + alpha.64 v2 caller propagation + alpha.64 v3 multi-word MONITOR fix + alpha.65 chart-bump rule + alpha.66 syncer HA executor swap together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → later scope.
Preserved scenes accumulated to 7 (alpha.62 + alpha.63 fresh-gatefix + alpha.64 v2-RED + alpha.64 v3 RED + alpha.65 v2-RED + alpha.64 CmpD orphan + alpha.65 CmpD orphan after alpha.66 install).
New risk (acknowledged in handoff): the production syncer image must include the IsRunning -> switch to admin db auto-switch logic. If live gate evidence does not show switch to admin db log appearance in the bounded window, the first-blocker upgrades to a syncer image / version mismatch (Class 4 sentinel), requiring the syncer team to upgrade the image before the alpha.66 v1 fix can be runtime-validated.

alpha.65 v2 fix scope (superseded by alpha.66 v1 above for the syncer HA executor privilege mismatch; alpha.65 v1+v2 chart bump path + Chart.yaml comment-grep removal preserved)

alpha.65 v2 fix scope (vs v1 commit `ea4e7aa0`)

v1 was caught at Jack's package-level review (Slock thread #mariadb:a09341b9 msg 721ad0a3, 11:45) for one ShellSpec example that depended on Chart.yaml comment text:

It "alpha.65 v1: Chart.yaml documents the CmpD immutability rationale [doc-marker]"
  When call grep -E "alpha.65 v1.*Jack 11:35.*live-gate RED" "${CHART_FILE}"
End

The test passed in the source-tree (where Chart.yaml retains the v1 root-cause comment block) but failed when ShellSpec was rerun inside an extracted package — helm package canonicalizes Chart.yaml (alphabetizes keys + removes blank lines/comments + strips quotes), so the source comment block was not in the in-package Chart.yaml. The package was therefore not internally self-consistent: the in-package spec file pointed at a comment that the in-package Chart.yaml did not have.

v2 fix: drop the doc-marker test (1 example removed; ShellSpec total 259 → 258). The CmpD-immutability rationale now lives in:

Source Chart.yaml comment block (visible to git users)
The Describe leading comment in replication_switchover_spec.sh (preserved verbatim in the in-package spec file because helm package does NOT canonicalize ShellSpec source files)
PR body
Slock handoff thread audit trail
Sediment doc backlog (committed after the alpha.65 fresh switchover N=1 GREEN cell)

Helen ship checklist gains a new step: before uploading any patch package, also run ShellSpec inside a fake-repo that mirrors the real repo layout (extract the package into <fakerepo>/addons/mariadb/..., copy .shellspec + shellspec/spec_helper.sh to <fakerepo>/, and run shellspec -I shellspec addons/mariadb/scripts-ut-spec/). Source-tree ShellSpec ≠ package-extraction ShellSpec for any test that depends on Chart.yaml or other helm-canonicalized chart metadata.

ShellSpec increments (258 examples / 0 failures, alpha.65 v1 → v2 = -1 net): removed the v1 It "alpha.65 v1: Chart.yaml documents the CmpD immutability rationale [doc-marker]" example. Added a comment block above the Describe "alpha.65 v1 chart version bump for CmpD immutability" block explaining the removal rationale and where the documentation now lives.

Meta-observation (5th methodology sample for the sediment backlog): alpha.65 v1 HOLD is a ship-pipeline口径 gap — review-side artifact (extracted package) ≠ author-side artifact (source tree) for tests that depend on Chart.yaml or other helm-canonicalized chart metadata. This is a different reverse-evidence dimension from the previous 4 (alpha.62/.63/.64v2 ShellSpec runtime-realism mock-coverage gap × 3 + alpha.64 v3 design-time CmpD-mutation-without-chart-bump assumption × 1). Cumulative sediment topic for after fresh switchover N=1 GREEN: "ShellSpec runtime-realism + design-time process check + ship-pipeline口径 are three independent gap dimensions that all require explicit closure".

v2 caveats (carry forward unchanged from alpha.65 v1):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml reports 1 hit at line 1920, but that line is a case pattern matcher, NOT an active GRANT statement.
Live-gate runtime negative gate (alpha.64 v3 inherits + alpha.65 v1 new CmpD positive verify) is unchanged from alpha.65 v1.
alpha.63 product code remains STILL UNVALIDATED at runtime; alpha.65 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches alpha.63 verifier + alpha.64 v1 grant body + alpha.64 v2 caller propagation + alpha.64 v3 multi-word MONITOR fix together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → later scope.
alpha.64 v3 RED preserved scene unchanged (helm rev61 + alpha.64 CmpD orphan).

alpha.65 v1 fix scope (superseded by v2 above for the Chart.yaml comment-grep ShellSpec example removal; Chart.yaml chart version bump + cmpd-semisync.yaml content all preserved unchanged)

alpha.65 v1 fix scope (closes alpha.64 v3 install/script live-gate RED on KubeBlocks ComponentDefinition immutability)

alpha.64 v3 was caught at Jack's install/script live-gate (Slock thread #mariadb:a09341b9 msg d767bfb4, 11:35): all alpha.64 v3 static + ShellSpec gates passed at the package level (sha lock + helm upgrade rc=0 + chart applied), but the live cluster never started a fresh namespace because the bounded CmpD/CmpV gate observed mariadb-semisync-1.1.1-alpha.64 stay Unavailable: 3:3 for ~140 seconds with message="immutable fields can't be updated" and KB controller log directly reporting ComponentDefinition mariadb-semisync-1.1.1-alpha.64 ... immutable fields can't be updated.

5-layer排除 ran cleanly to product/package-contract layer; first blocker = KubeBlocks treats the existing ComponentDefinition spec as immutable. alpha.64 v2 + v3 mutated templates/cmpd-semisync.yaml (which is the rendered ComponentDefinition spec body) under the same chart version 1.1.1-alpha.64, so on helm upgrade Helm could apply the manifest but the KB reconciler refused to update the existing CmpD object. The alpha.64 v3 multi-word MONITOR fix was therefore runtime-unvalidated — fresh install was not even started.

The historical pattern that masked this: alpha.61 v2/v3, alpha.62 v2, alpha.63 v2 all reused the same chart version because they only mutated replication-switchover.sh (which lives inside a versioned ConfigMap mariadb-replication-scripts-1-1-1-alpha-N; the script CM data is mutable). alpha.64 v1 was the first within-cycle change to mutate cmpd-semisync.yaml directly, but it bumped the chart from .63 to .64 and so created a fresh CmpD. alpha.64 v2 + v3 stayed at .64 and triggered the immutability rejection. The ShellSpec coverage for the v2/v3 cycles never asserted the chart-version-bump-when-CmpD-mutates rule, so the package gate did not catch it; the live gate was the first observation.

alpha.65 v1 fixes (Chart.yaml + ShellSpec; cmpd-semisync.yaml unchanged):

Chart.yaml: bump version from 1.1.1-alpha.64 to 1.1.1-alpha.65. appVersion remains 11.4.10 because the mariadb engine version is unchanged; this bump is packaging-contract only. The version-line comment block now documents the CmpD immutability rationale and the rule that any future patch within an alpha cycle that mutates cmpd-*.yaml MUST bump the chart version (so that KubeBlocks creates a new CmpD object instead of trying to mutate an existing one).
cmpd-semisync.yaml content is preserved verbatim from alpha.64 v3 (sha 237eddbc42acc662329fd5b6a654633a80dce94756de4331af48db3c23d3999a). All alpha.64 v1 grant body alignment, alpha.64 v2 caller-side rc propagation + tier annotation + preStop fail-closed token, and alpha.64 v3 multi-word inline-quoted MONITOR list remain in place.

ShellSpec increments (259 examples / 0 failures, alpha.64 v3 → alpha.65 v1 = +4 net): new Describe alpha.65 v1 chart version bump for CmpD immutability with 4 examples — chart version: field is exactly 1.1.1-alpha.65 (positive literal); appVersion: field still contains 11.4.10 (mariadb engine version unchanged); Chart.yaml comment block contains the documentation marker alpha.65 v1.*Jack 11:35.*live-gate RED (so future readers see the rule at the version-bump site); cmpd-semisync.yaml still contains the alpha.64 v3 root-cause comment marker (proves the CmpD spec content is preserved verbatim from alpha.64 v3).

alpha.65 v1 caveats (carry forward unchanged from alpha.64 v1+v2+v3):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml reports 1 hit at line 1920, but that line is a case pattern matcher, NOT an active GRANT statement.
Live-gate runtime negative gate continues to watch for standalone single-word tokens, the v2 preStop fail-closed token, and the v2 caller-propagation Tier B fail-closed tokens; alpha.65 v1 adds the new positive expectation that kubectl get cmpd mariadb-semisync-1.1.1-alpha.65 becomes Available within the bounded gate window.
alpha.63 product code (alpha.63 v1+v2 verifier impl + grant_option_residual + I-2 multi-line 1044 probe extraction) remains STILL UNVALIDATED at runtime. alpha.65 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches alpha.63 verifier + alpha.64 v1 grant body + alpha.64 v2 caller propagation + alpha.64 v3 multi-word MONITOR fix together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → later scope.
alpha.64 v3 RED preserved scene: helm revision 61 with mariadb-semisync-1.1.1-alpha.64 Unavailable orphan is left as evidence of the immutability gap. Fresh alpha.65 install creates a new mariadb-semisync-1.1.1-alpha.65 CmpD; the alpha.64 orphan does not block the new install and can be optionally cleaned up later via kubectl delete cmpd mariadb-semisync-1.1.1-alpha.64.

alpha.64 v3 fix scope (superseded by alpha.65 v1 above for the chart-version bump triggered by KubeBlocks ComponentDefinition immutability; alpha.64 v1+v2+v3 cmpd-semisync.yaml content all preserved verbatim in alpha.65)

alpha.64 v3 fix scope (vs v2 commit `73072452`)

v2 was caught at Jack's install/script live-gate (Slock thread #mariadb:a09341b9 msg 2777e671, 11:14): fresh install passed sha lock + helm upgrade rc=0 + ComponentVersion + 3 alpha.64 CmpD all Available + script CM RV bumped + every static negative grep clean, but the live cluster never converged because pods 2/2 Ready but Cluster/Component stayed in Creating with Component condition Healthy=False reason=RoleProbeNotDone. prestop-watchdog.log showed a tight loop of local-root-optional-privilege privilege=BINLOG ... rc=1 tier=monitor-best-effort 1227_swallowed=true and privilege=MONITOR and privilege=SLAVE and privilege=MONITOR (4 entries per round); root grants only contained BINLOG MONITOR (no SLAVE MONITOR); mariadb logs reported repeated Error 1227 ... need SLAVE MONITOR from SHOW SLAVE STATUS, breaking promote/demote.

5-layer排除 ran cleanly to product layer: install/script gate front-half passed (Helm/CmpV/CmpD/script CM all healthy); first blocker = product / addon templates/cmpd-semisync.yaml runtime shell contract. The CMPD_OPTIONAL_MONITOR_PRIVS="BINLOG MONITOR SLAVE MONITOR" string contains MULTI-WORD privilege names but was iterated via unquoted for privilege in ${CMPD_OPTIONAL_MONITOR_PRIVS} at both grant_optional_local_root_privileges and grant_optional_remote_root_privileges. POSIX for splits unquoted parameter expansion on IFS (whitespace) into 4 single-word tokens (BINLOG / MONITOR / SLAVE / MONITOR); GRANT BINLOG ON *.* ... is invalid SQL so root never acquired the actual SLAVE MONITOR privilege. v1/v2 ShellSpec strong-bound the constant value (grep CMPD_OPTIONAL_MONITOR_PRIVS= saw the full multi-word string) but never asserted the loop expansion behavior — runtime gate filled that ShellSpec runtime-realism mock-coverage gap.

v3 fixes (cmpd-semisync.yaml only; v1 grant body alignment + v2 caller propagation + tier annotation + preStop fail-closed token all preserved unchanged):

Both grant_optional_local_root_privileges (line 664) and grant_optional_remote_root_privileges (line 810) now iterate the inline quoted list for privilege in "BINLOG MONITOR" "SLAVE MONITOR"; do ... ; done. POSIX for with quoted args preserves multi-word tokens.
The CMPD_OPTIONAL_MONITOR_PRIVS constant is retained for v1 ShellSpec strong-bind (focal [Feature]PolarDB-X member reconfiguration support #4 — alpha.64 v1 contracts not regressed) but its declaration block adds an extensive root-cause comment explaining the IFS-splitting trap and pointing to the inline-quoted-list pattern at the callsites; the constant is now documentation + ShellSpec strong-bind only.
Per-callsite docstrings updated with a v3 note pointing to the constant block.

ShellSpec increments (255 examples / 0 failures, alpha.64 v2 → v3 = +6 net): new Describe alpha.64 v3 cmpd-semisync multi-word MONITOR priv loop with 6 examples in 3 contexts — Context "no unquoted CMPD_OPTIONAL_MONITOR_PRIVS for-loop residual" (2 examples: negative grep for both braced ${...} and no-brace $... variants in active code, with comment lines stripped so the v3 root-cause docstring is allowed); Context "inline quoted MONITOR list at both callsites" (2 examples: per-function awk-extract function body with comment lines stripped, asserts the v3 inline-quoted pattern present and the v1/v2 unquoted-loop pattern absent); Context "live-gate runtime negative gate documentation" (2 examples: documentation marker for the v3 root-cause comment + a contract-no-regression spot-check that asserts the v1+v2 invariant token counts equal 1 1 4 1 16 — CMPD_EXPLICIT_PRIMARY_GRANT_BODY= + CMPD_SECONDARY_FENCE_GRANT_BODY= + if ! set_replica_read_only × 4 + prestop_lock_failed_both fail_closed=true tier=required + 16 tier-annotated swallow lines).

v3 caveats (unchanged from v1+v2):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml reports 1 hit at line 1920, but that line is a case pattern matcher, NOT an active GRANT statement.
Live-gate runtime negative gate now also watches for standalone single-word tokens (privilege=BINLOG / privilege=MONITOR / privilege=SLAVE followed by label=) in prestop-watchdog.log; only the multi-word success entries privilege=BINLOG MONITOR ... rc=0 and privilege=SLAVE MONITOR ... rc=0 should appear.
alpha.63 product code (alpha.63 v1+v2 verifier impl + grant_option_residual + I-2 multi-line 1044 probe extraction) remains STILL UNVALIDATED at runtime; alpha.64 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches alpha.63 verifier + alpha.64 v1 grant body + alpha.64 v2 caller propagation + alpha.64 v3 multi-word MONITOR fix together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.65+ scope.
v2-RED preserved scene mariadb-alpha64-livegate-110546 is now retained as the runtime evidence baseline for the v3 multi-word fix to compare against.

alpha.64 v2 fix scope (superseded by v3 above for the multi-word MONITOR priv shell-splitting; v2 caller-side rc propagation + tier annotation + preStop fail-closed token all preserved)

alpha.64 v2 fix scope (vs v1 commit `222d36bf`)

v1 was caught at Jack's package-level review (Slock thread #mariadb:a09341b9 msg 0b5b4556, 10:32) for two contract gaps that v1 left at the caller side even though the helper-side return semantics were correctly tightened by the grant-body alignment:

Blocker 1 (Tier B required LOCK swallowed by caller || true): v1 enforced grant body alignment inside set_local_root_account_state / set_remote_root_account_state / lock_local_root_for_prestop and the helpers correctly returned 1 on grant failure, but the caller path inside set_replica_read_only, keep_replica_pending_until_healthy, expose_sql_listener_for_safe_role, publish_replica_after_rejoin_ready, configure_replication_from_primary_service_once, and reconcile_sql_listener_for_syncer_secondary_once still wrapped them in || true. So a 1227-fenced grant (or any other lock failure) was logged structurally but did not stop ready / role / sql-listener publish. This violated the 10:07 Tier B contract: required LOCK/UNLOCK/prestop failure must return 1 AND caller must check rc.
Blocker 2 (preStop double-failure masked by trailing || true): lock_local_root_for_prestop "prestop" "socket" || lock_local_root_for_prestop "prestop" "tcp" || true swallowed double-failure. preStop already removes the ready marker so the impact is bounded, but the contract-level wording in v1 handoff said "Tier B required → caller checks rc" and the code did not.

v2 fixes (cmpd-semisync.yaml only; v1 grant body alignment is preserved unchanged):

Tier B caller propagation:

set_replica_read_only body now tracks rc across the three required steps (remote LOCK, fail-closed read_only, local LOCK) and return 1 on any failure with structured log tier=required fail_closed=true.
keep_replica_pending_until_healthy body uses the same pattern; existing if ! callers automatically propagate.
expose_sql_listener_for_safe_role checks required local LOCK + fail-closed read_only via if ! ...; then return 1; fi BEFORE touch .sql-listener-ready.
publish_replica_after_rejoin_ready replaces both set_replica_read_only || true callsites with if ! set_replica_read_only; then return 1; fi; mark_replication_ready is reached only after every required step succeeded.
configure_replication_from_primary_service_once checks set_replica_read_only at function entry via if ! and returns 1 on failure.
reconcile_sql_listener_for_syncer_secondary_once checks set_replica_read_only via if ! and marks pending + returns 1 on failure BEFORE any path can reach mark_replication_ready.

preStop double-failure:

Replaces the socket || tcp || true chain with explicit if ! lock_local_root_for_prestop "prestop" "socket"; then if ! lock_local_root_for_prestop "prestop" "tcp"; then prestop_log "prestop_lock_failed_both fail_closed=true tier=required"; fi; fi block. Live-gate runtime negative gate watches for the new prestop_lock_failed_both fail_closed=true tier=required token (must NOT appear in healthy install windows).

Tier annotation auditable list (per Jack 10:38 review-checkpoint 3):

Every allowed lock_(local|remote)_root_writes ... || true callsite carries an inline # tier=startup-defensive|error-recovery|fail-path-defensive|monitor-best-effort annotation. Total 16 annotated callsites distributed as: 4 startup-defensive (pre-role-decision and wait-primary-loop-entry), 4 error-recovery (fresh-health-check repair paths and SQL thread start failures), 8 fail-path-defensive (already-failing primary write gate, no-primary existing datadir, GTID divergence, after-expose-not-healthy in publish_replica). The pattern lets git grep tier= audit the full list with a single command.

ShellSpec increments (249 examples / 0 failures, alpha.64 v1 → v2 = +12 net): new Describe alpha.64 v2 cmpd-semisync Tier B caller propagation contract with 12 examples — Context "tier annotation auditable list" (4 examples: tier annotation required on every required-pattern || true line, NO set_replica_read_only || true residual, NO lock_local_root_for_prestop ... || true residual, tier annotation count == required-pattern || true count invariant); Context "Tier B caller-side rc propagation pattern" (6 examples covering the 6 caller functions, asserting each function body contains the new if ! ...; return 1 pattern AND does NOT contain the v1 swallow patterns); Context "preStop double-failure fail-closed token" (2 examples: rendered manifest contains the prestop_lock_failed_both fail_closed=true tier=required literal, and the preStop block uses if ! lock_local_root_for_prestop instead of trailing || true).

v2 caveats (unchanged from v1):

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml reports 1 hit at line 1920, but that line is a case pattern matcher (*"GRANT ALL PRIVILEGES"*)) inside the prestop watchdog error-classification block, NOT an active GRANT statement.
Live-gate runtime negative gate on prestop-watchdog.log for both the v1 admin-bypass-free behavior and the new v2 prestop_lock_failed_both token to be evidenced during install/script live gate.
alpha.63 product code (alpha.63 v1+v2 verifier impl + grant_option_residual + I-2 multi-line 1044 probe extraction) is STILL UNVALIDATED at runtime; alpha.64 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches both alpha.63 verifier and the alpha.64 v1+v2 cmpd-side fix together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.65+ scope.

alpha.64 v1 fix scope (superseded by v2 above for caller-side rc propagation; v1 grant body alignment preserved unchanged)

alpha.64 v1 fix scope (closes alpha.63 fresh-gatefix switchover N=1 RED root cause)

The alpha.63 fresh-gatefix switchover N=1 cell came back RED with the alpha.63 v1+v2 verifier reporting bypass_priv_residual against user-facing root for SUPER / READ_ONLY ADMIN / BINLOG ADMIN / CONNECTION ADMIN / REPLICATION SLAVE ADMIN / REPLICATION MASTER ADMIN, despite the switchover-side and roleProbe-side fence both having been tightened across alpha.59-.63 to grant only the non-admin-bypass minimum list.

5-layer排除 ran cleanly to product layer with the smoking gun in two read-only evidence sources (Slock thread #mariadb:a09341b9 Jack closeout msg 2219dcb5):

prestop-watchdog.log 8 lines at 01:25:11-13Z: local-root-account-UNLOCK mode=full-access label=primary-read-write host=127.0.0.1 rc=0 followed by 7 local-root-optional-privilege lines for the exact admin bypass privs the verifier later observed
6-sample SHOW GRANTS timeline 01:53:19Z→01:53:31Z: root@127.0.0.1 and root@localhost consistently GRANT ALL PRIVILEGES WITH GRANT OPTION; root@% mostly explicit primary list but sample=2 transiently GRANT ALL PRIVILEGES — confirming the sql-listener-fence reconcile loop persistently re-grants, NOT a transient flap

First blocker = product / addon templates/cmpd-semisync.yaml runtime UNLOCK / LOCK / prestop-fence paths re-grant admin bypass to user-facing root via 7 callsite functions. The "user-facing root contains no admin bypass" contract (committed in alpha.59 onward) had actually never been enforced in the cmpd-yaml runtime path; switchover-side and roleProbe-side were tightened, but cmpd-yaml runtime sql-listener-fence reconcile loop persistently re-granted them back. Earlier alpha.59-.62 verifiers were not fine-grained enough to observe; alpha.63 v1+v2 verifier finally observed → first fail-closed → root cause exposed.

alpha.64 v1 fixes (cmpd-semisync.yaml only — templates/cmpd-replication.yaml and templates/cmpd-galera.yaml were scanned and contain no GRANT/REVOKE SQL, scope confirmed):

Three new constants in templates/cmpd-semisync.yaml (line 153-176 region, manually aligned to the equivalent constants in scripts/replication-switchover.sh since helm template cannot directly source the script):

CMPD_EXPLICIT_PRIMARY_GRANT_BODY — equivalent to SWITCHOVER_EXPLICIT_PRIMARY_GRANT_BODY (SELECT/INSERT/UPDATE/DELETE/CREATE/DROP plus the standard primary-role privs through CREATE USER, including REPLICATION MASTER ADMIN for follow + status repair)
CMPD_SECONDARY_FENCE_GRANT_BODY — equivalent to SWITCHOVER_SECONDARY_FENCE_GRANT_BODY (SELECT/PROCESS/RELOAD/REPLICATION SLAVE/REPLICATION CLIENT/REPLICATION MASTER ADMIN — no admin bypass)
CMPD_OPTIONAL_MONITOR_PRIVS — BINLOG MONITOR SLAVE MONITOR only (read-only legitimate observability)

7 callsite alignments (all on user-facing root account class; kb_internal_root maintenance executor remains legit with full ALL PRIVILEGES exception):

#	line	function	violation before	fix after
1	590	`grant_optional_local_root_privileges`	7-priv loop including admin bypass	`${CMPD_OPTIONAL_MONITOR_PRIVS}` only
2	622	`set_local_root_account_state LOCK`	GRANT … SUPER …	`${CMPD_SECONDARY_FENCE_GRANT_BODY}` (drops SUPER)
3	633	`set_local_root_account_state UNLOCK`	GRANT ALL PRIVILEGES	`${CMPD_EXPLICIT_PRIMARY_GRANT_BODY}`
4	672	`set_remote_root_account_state LOCK`	GRANT … SUPER …	`${CMPD_SECONDARY_FENCE_GRANT_BODY}`
5	683	`set_remote_root_account_state UNLOCK`	GRANT ALL PRIVILEGES	`${CMPD_EXPLICIT_PRIMARY_GRANT_BODY}`
6	696	`grant_optional_remote_root_privileges`	5-priv loop including admin bypass	`${CMPD_OPTIONAL_MONITOR_PRIVS}` only
7	1564-1587	`lock_local_root_for_prestop`	GRANT … SUPER …	inline literal grant body equivalent to CMPD_SECONDARY_FENCE_GRANT_BODY (preStop hook runs in an independent `/bin/sh -c` shell scope and cannot reuse the main shell's variables)

Tier A vs Tier B failure semantics (per Jack 10:05 contract tightening):

Tier A (best-effort monitor) — callsites 1 and 6 (grant_optional_*_root_privileges): per-priv grant failure logs prestop_watchdog_log "...rc=1 tier=monitor-best-effort 1227_swallowed=true" and continues the loop; caller returns 0; ready / role marker is still written.
Tier B (required, fail_closed) — callsites 2/3/4/5/7 (LOCK / UNLOCK / prestop fence): grant failure logs prestop_watchdog_log "...rc=1 tier=required 1227_swallowed=true fail_closed=true" and return 1; caller checks rc != 0 and does NOT publish ready / role marker.

scripts-ut-spec/semisync_rejoin_fence_template_spec.sh also updated (2 pre-existing tests on the rejoin fence template, lines 104-114) to assert the new alpha.64 grant body without SUPER, aligning with the new contract.

ShellSpec increments (237 examples / 0 failures, alpha.63 v2 → alpha.64 v1 = +9 net): new Describe alpha.64 v1 cmpd-semisync grant body contract with 8 examples — (1) CMPD_EXPLICIT_PRIMARY_GRANT_BODY strong-bind contains SELECT/INSERT/UPDATE/DELETE/CREATE/DROP/RELOAD but not SUPER/READ_ONLY ADMIN/BINLOG ADMIN/CONNECTION ADMIN/ALL PRIVILEGES; (2) CMPD_SECONDARY_FENCE_GRANT_BODY strong-bind contains the non-admin-bypass minimum list but not SUPER/INSERT/UPDATE/DELETE/ALL PRIVILEGES; (3) CMPD_OPTIONAL_MONITOR_PRIVS strong-bind contains BINLOG MONITOR + SLAVE MONITOR only, no admin bypass; (4) negative awk-block analysis on rendered cmpd-semisync.yaml grant blocks excludes user-facing root host admin bypass privs (skipping MARIADB_INTERNAL_ROOT_USER context within ±30 lines so the kb_internal_root maintenance executor is not over-matched); (5) kb_internal_root positive allowlist — GRANT ALL PRIVILEGES and admin grants for kb_internal_root remain visible (legitimate maintenance executor exception); (6) MONITOR positive allowlist — BINLOG MONITOR + SLAVE MONITOR remain visible in the rendered manifest (read-only observability); (7) Tier A log token grep — tier=monitor-best-effort 1227_swallowed=true token present at the matching callsites; (8) Tier B log token grep — tier=required 1227_swallowed=true fail_closed=true token present at LOCK / UNLOCK / prestop fence callsites; plus 2 updated tests in semisync_rejoin_fence_template_spec.sh aligning to the new grant body without SUPER.

v1 caveats:

Raw grep "GRANT ALL PRIVILEGES" rendered.yaml reports 1 hit at line 1920, but that line is a case pattern matcher (*"GRANT ALL PRIVILEGES"*)) inside the prestop watchdog error-classification block, NOT an active GRANT statement. ShellSpec uses awk-block analysis instead of raw grep so the false positive does not mask the real contract; both behaviors are documented in the spec example comments.
Live-gate runtime negative gate on prestop-watchdog.log for the new admin-bypass-free behavior is to be evidenced during the install / script live gate (handoff does NOT include live evidence; live gate to follow ACCEPT).
alpha.63 product code (alpha.63 v1+v2 verifier impl + grant_option_residual + I-2 multi-line 1044 probe extraction) is STILL UNVALIDATED at runtime (alpha.63 fresh-gatefix switchover N=1 RED'd before exercising the verifier write path); alpha.64 fresh switchover/role-transition under-load N=1 will be the first runtime exercise that reaches both alpha.63 verifier and alpha.64 cmpd-side fix together.
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe (independently attached, NOT covered by alpha.64 ACCEPT).
DRIFT D upgraded from "alpha.65+ scope" to "alpha.64 v1 main fix" — no longer carry-forward.
alpha.61 v3 prepare/fence inner SQL helper budget caveat → continues into alpha.65+ scope.

alpha.63 v2 fix scope (superseded by alpha.64 v1 above for the cmpd-side runtime grant body alignment; alpha.63 v1+v2 verifier impl + grant_option_residual contract retained — STILL UNVALIDATED at runtime)

alpha.63 v2 fix scope (vs v1 commit `423703eb`)

v1 was caught at Jack's package-level review (Slock thread #mariadb:a09341b9 msg 4cfdd261, 08:36) for one contract field that the v1 implementation left unenforced at the verifier read site:

The 05:26 design contract said: "non-proxy WITH GRANT OPTION must fail-closed". v1 only removed GRANT OPTION literal token from SWITCHOVER_USER_FACING_WRITE_PATTERN and added the line-anchored proxy whitelist SWITCHOVER_GRANTS_IGNORED_LINE_PATTERN — closing the false-RED on the default GRANT PROXY ... WITH GRANT OPTION row, but a SELECT-only-with-GRANT-OPTION input like GRANT SELECT ON *.* TO 'root'@'%' WITH GRANT OPTION would now false-PASS:

not whitelisted (doesn't match ^GRANT PROXY ON .*)
SELECT not in user-facing-write pattern → write_residual empty
no SUPER/READ_ONLY ADMIN/etc. → bypass_residual empty
verifier returned ok_by_grants_only

The v1 ShellSpec example for "non-proxy WITH GRANT OPTION must fail-closed" used GRANT INSERT, UPDATE ... WITH GRANT OPTION, which fail-closes via INSERT/UPDATE in the write residual scan — so the WITH-GRANT-OPTION-as-bypass-token semantic was never actually exercised.

v2 fix:

New grant_option_residual check in _verify_host_is_fenced AFTER the proxy whitelist filter and AFTER the user-facing-write residual check. The check awks for any line containing literal WITH GRANT OPTION (with leading space — the trailing clause marker). Since PROXY rows have already been removed by the whitelist filter, any remaining match is non-proxy → fail-closed with a distinct sentinel reason=grant_option_residual (NOT folded into bypass_priv_residual, so closeout can grep specifically for this token-level violation vs priv-name-level violations). Structured log adds grants_bypass=GRANT_OPTION field plus a separate grant_option_residual_dump_begin/end block dumping the offending lines.
Short-circuit order locked (per Jack 08:37 ACK): bypass_priv_residual (admin bypass priv names) → bypass_priv_residual:<write_priv> (INSERT/UPDATE/...) → grant_option_residual (WITH GRANT OPTION clause). ShellSpec asserts this precedence so a real GRANT INSERT WITH GRANT OPTION input still produces bypass_priv_residual:INSERT,UPDATE (NOT grant_option_residual), preserving alpha.63 v1 semantics for that case while v2 catches the WITH-GRANT-OPTION-only edge case.

ShellSpec increments (228 examples / 0 failures, alpha.63 v1 → v2 = +2 net): NEW GRANT SELECT ... WITH GRANT OPTION (no write priv name + GRANT OPTION clause) → fail-closed reason=grant_option_residual + grants_bypass=GRANT_OPTION + grant_option_residual_dump assertions (Jack reproducer); NEW short-circuit precedence lock — GRANT INSERT, UPDATE ... WITH GRANT OPTION still hits bypass_priv_residual:INSERT,UPDATE, NOT grant_option_residual.

v2 caveats unchanged from v1: DRIFT D / alpha.61 v3 inner SQL helper budget caveat / pod0 admin bypass residual carry-forward / alpha.61 process miss preserved at 5a6390fe. Cadence-discipline candidate topic restated as "process-discipline + runtime-validation are independent dimensions".

alpha.63 v1 fix scope (superseded by v2 above for the GRANT OPTION token semantic; v1 I-1 + I-2 fixes retained)

alpha.63 v1 fix scope (closes alpha.62 switchover N=1 RED)

alpha.62 switchover/role-transition under-load N=1 came back RED at the pre-DCS verifier (Slock thread #mariadb:a09341b9 Jack closeout msg e89a5559):

Switchover failed: could not fence current primary remote root before DCS switchover

5-layer排除排到 product layer; first blocker = product / addon switchover action verifier implementation drift (NOT contract drift; alpha.62 design contract was correct, two implementation bugs in the new verifier escaped both ShellSpec coverage and 8-class XP review because they only surface against runtime-realism inputs):

I-1: SWITCHOVER_USER_FACING_WRITE_PATTERN included GRANT OPTION literal token, which over-matched the default GRANT PROXY ON ''@'%' TO 'root'@'%' WITH GRANT OPTION row that mariadb auto-creates and survives REVOKE ALL PRIVILEGES (PROXY priv is in a separate priv class). Verifier reported bypass_priv_residual:GRANT OPTION for all hosts (% / localhost) even though the actual fence main grant was clean.
I-2: _local_root_write_probe_127 returned printf '%s|%s|%s' rc errno out on stdout. When out contained the multi-line SQL stderr that mariadb client emits, the caller's cut -d'|' -f2 returned 1044\n<line2-of-stderr> (cut operates per line; lines 2+ have no |, so cut returned the whole line for field 2). The case-statement against this multi-line value never matched the 1044|1290|1142 literals → real priv-based fence misclassified as probe_account_mismatch.

Direct evidence: kbagent action cost 2.216s (NOT 60s cap; no alpha.61 v3 sentinel hit; only entered prepare stage); writer double_writable=0 (race not surfaced; fence semantics correct — action just fail-closed at the verifier口径); data 33/561 vs 33/561 (no data loss; pod0=secondary, pod1=primary); 三视角一致 ✓.

alpha.63 v1 fixes (per Jack 05:24 instrumentation tightening):

I-1 fix: GRANT OPTION token REMOVED from SWITCHOVER_USER_FACING_WRITE_PATTERN (it was a trailing modifier, not a priv name; remaining tokens INSERT/UPDATE/DELETE/CREATE/DROP/ALTER/CREATE USER are unambiguous priv names). Defense-in-depth: line-anchored whitelist constant SWITCHOVER_GRANTS_IGNORED_LINE_PATTERN='^GRANT PROXY ON .* TO .* WITH GRANT OPTION$' is applied BEFORE the bypass / write residual scan via three independent helpers (_filter_grants_keep_unmatched, _count_grants_matched_whitelist, _dump_grants_matched_whitelist — each in its own $(...) subshell to avoid the "globals do not survive command substitution" pitfall). The verifier log adds grants_ignored_count=<N> to every line and dumps ignored lines after the main grants_dump on failure paths. Surprise lines like GRANT INSERT ... WITH GRANT OPTION are NOT silently whitelisted (line-anchored pattern is precise, not broad grep -v PROXY).
I-2 fix: _local_root_write_probe_127 now writes its three result fields into module-scope global variables __PROBE_RC / __PROBE_ERRNO / __PROBE_OUT instead of joining with | and echoing on stdout. Caller pre-clears the three globals BEFORE the call (defends against stale value reuse) and post-validates that __PROBE_RC is non-empty numeric (else fail-closed probe_result_malformed) and __PROBE_ERRNO is in the 5-value valid set {1044, 1290, 1142, 0, other} (else fail-closed probe_result_malformed_errno). Multi-line SQL stderr is preserved intact in __PROBE_OUT and dumped after the structured log line on failure paths.

ShellSpec increments (226 examples / 0 failures, alpha.62 v2 → alpha.63 v1 = +11 net): Context "grants whitelist helpers" (5 examples covering _filter_grants_keep_unmatched happy / _count_grants_matched_whitelist returns 1 / non-PROXY WITH GRANT OPTION not whitelisted / count=0 case / multiple PROXY rows count=2); Context "_verify_host_is_fenced() runtime-realism: GRANT PROXY default row" (2 examples — % and localhost host both with PROXY default + main fence grant → reason=ok_by_grants_only + grants_ignored_count=1; the precise fix sample for alpha.62 v1/v2 false-RED bypass_priv_residual:GRANT OPTION); Context "_local_root_write_probe_127() global var hardening" (4 examples — pre-clear stale defense; 127.0.0.1 with multi-line SQL stderr containing 1044 → __PROBE_ERRNO=1044 correctly extracted, alpha.62 RED root cause closed; post-validate __PROBE_RC non-numeric → probe_result_malformed; post-validate __PROBE_ERRNO out-of-set → probe_result_malformed_errno).

v1 caveats carried:

DRIFT D (cmpd-semisync.yaml UNLOCK GRANT ALL PRIVILEGES) → alpha.64+ scope
alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.64+ scope
pod0 secondary admin bypass residual after failed switchover (Jack alpha.62 RED carry-forward observation) → alpha.63+ candidate; not in this v1 scope unless callsite-pair scan finds new evidence pointing to a path other than I-1/I-2
alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe
alpha.62 RED already invalidated the cadence-discipline candidate topic's "alpha.62 review-pass→execute clean = success" narrative (Cindy 05:14 directive ba10ff18): process clean + runtime RED actually exposed ShellSpec mock-coverage's runtime-realism gap. Cadence-discipline candidate restated as "process-discipline + runtime-validation are independent dimensions; both required".

alpha.62 v2 fix scope (superseded by alpha.63 v1 above for the verifier implementation; alpha.62 v2 design contract retained)

alpha.62 v2 fix scope (vs v1 commit `675f5371`)

v1 (675f5371) was caught at Jack's package-level review (Slock thread #mariadb:a09341b9 msg c66d35bf) for two issues that don't change runtime behavior but break the v2 design's live-gate negative-grep contract:

Live-gate negative grep blocker: v1 design committed to alpha.62 live gate negative grep on the literal function names grant_remote_root_optional_admin_privileges_for_secondary and remote_root_has_full_access. The function bodies were correctly removed/renamed in v1, but the comments in the rewritten functions still referenced the old names verbatim — would false-RED the live gate or carry a comment-only caveat (alpha.61 already has the same anti-pattern for $SECONDS / $'\n' comments — alpha.62 should not double down). Fix: rewrite the four comment mentions to descriptive text ("legacy optional secondary admin grant helper", "legacy full-access rollback verifier", etc.) so the literal old function names appear nowhere in source nor rendered manifest. Verified by direct grep: 0 hits in helm template mariadb mariadb-1.1.1-alpha.62.tgz for either name.
grants_sha format tightening: v1 returned grants_sha=unavailable:hash_tool_unavailable (single colon-joined field) per the v1 design. v2 design instead splits this into two structured fields: grants_sha=<hash|unavailable> reason_hash=<sha256|sha1|md5|hash_tool_unavailable>. This avoids grep/awk needing to disambiguate colon semantics in downstream parsers and matches the rest of the structured log style. Fix: compute_grants_sha now emits <hash>|<algo> (pipe-separated internal token used for direct comparison in drift detection); a new helper split_grants_sha_field produces the two-field log fragment grants_sha=<hash> reason_hash=<algo> for inline use in verifier log lines. All verifier log lines (_verify_host_is_fenced and _verify_host_has_explicit_primary_grant) now embed ${grants_sha_field} (already-formatted) instead of the legacy grants_sha=${grants_sha} template. The host_list_sha debug logs (informational, not part of the structured verifier contract) keep the internal <hash>|<algo> form for drift comparison.

ShellSpec increments (215 examples / 0 failures, v1 → v2 = +3 net): renamed Context "compute_grants_sha()" → "compute_grants_sha() / split_grants_sha_field()" (5 examples total: sha256 happy path, unavailable|hash_tool_unavailable token format, split happy path, split unavailable case, split defensive single-token case); updated _verify_host_is_fenced 127.0.0.1 ok_by_local_probe:1044 example to assert grants_sha + reason_hash=sha256 fields appear as TWO separate fields (not colon-joined).

v2 caveats unchanged from v1: DRIFT D / alpha.61 v3 inner SQL helper budget caveat / alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post 5a6390fe — all remain alpha.63+ scope or independently attached. v1 attachment 05e92860-... had Slock fetch transient; v2 fresh attachment id available in handoff message.

alpha.62 v1 fix scope (superseded by v2 above for comment-level cleanup + grants_sha format; runtime contract unchanged)

alpha.62 v1 fix scope (closes alpha.61 switchover N=1 RED)

alpha.61 switchover/role-transition under-load N=1 came back RED at the pre-DCS local_remote_root_is_fenced_for_secondary verifier (Slock thread #mariadb:a09341b9 Jack closeout msg 40e83143):

Switchover failed: current primary remote root fence was not verified before DCS switchover

5-layer排除排到 product layer; first blocker = addon switchover action pre-DCS remote-root fence + rollback verifier contract drift between switchover-side callsites and roleProbe-side callsites that were tightened in alpha.61 v3 but did NOT propagate. Direct evidence: kbagent action cost 1.945s (NOT 60s cap; no v3 deadline/timeout sentinel hit — only entered prepare stage); writer double_writable=0 (race not surfaced; fence semantics actually in tightening — action fail-closed correctly); data 54/1485 vs 54/1485 (no data loss; no data-loss conclusion is written in this RED cell).

alpha.62 v1 fixes (per Jack 04:08 v1 design review + 04:10 v2 ACCEPT + 04:12 6 review focal points + 04:13 boundary lock):

DRIFT A — switchover pre-DCS supplementary admin grant: removed grant_remote_root_optional_admin_privileges_for_secondary entirely. fence_local_remote_root_for_secondary previously called it immediately after the main fence, granting BINLOG ADMIN / CONNECTION ADMIN / READ_ONLY ADMIN back to user-facing root — defeating alpha.61 secondary fence tightening in the same callsite.
DRIFT B — rollback verifier required GRANT ALL PRIVILEGES that alpha.60 v2 unfence no longer grants: renamed remote_root_has_full_access → remote_root_has_explicit_primary_grant. New verifier reads grants via kb_internal_root view, requires the core write subset (INSERT/UPDATE/DELETE/CREATE/DROP), rejects GRANT ALL PRIVILEGES, rejects admin bypass privileges.
DRIFT C — local_remote_root_is_fenced_for_secondary observability gap + 口径漂移: replaced with strong-semantics observable per-host verifier — reads grants via kb_internal_root (avoids root self-query loop); explicit reject of bypass privileges and user-facing write privileges; structured single-line log with grants_sha (sha256 → sha1 → md5 → unavailable:hash_tool_unavailable fallback chain), grants_bypass list, write_probe_attempted, write_probe_rc, write_probe_errno, verified_host, probe_host attribution, reason; full grants dump after sentinel line on failure; per-host write probe scope: 127.0.0.1 → TCP probe expecting 1044/1290 errno; localhost → grants-only (no socket probe attempted); % wildcard → grants-only (not locally probable).
Per-host enumeration (Jack 04:08 Blocker 1 Option B): replaces single-host root@${MARIADB_ROOT_HOST:-%} fence with mysql.user enumeration through kb_internal_root. host_list is read ONCE in prepare_current_primary_for_switchover and passed to fence + verifier; functions called externally fall back to self-enumerate with drift detection (sha mismatch → fail-closed root_host_list_drift, NEVER silent).
Single-source-of-truth constants: SWITCHOVER_BYPASS_PRIVILEGES_PATTERN, SWITCHOVER_USER_FACING_WRITE_PATTERN, SWITCHOVER_SECONDARY_FENCE_GRANT_BODY, SWITCHOVER_EXPLICIT_PRIMARY_GRANT_BODY, SWITCHOVER_PRIMARY_CORE_WRITE_PRIVS. ShellSpec strong-binds the EXPLICIT_PRIMARY_GRANT_BODY contains the core write privs, preventing future drift.
candidate_is_primary lost its legacy remote_root_has_full_access check (the legacy GRANT ALL PRIVILEGES signature no longer exists post-alpha.60 v2 unfence + alpha.61 v3 roleProbe primary fence). Remaining 4 signals (read_only=0 + no slave_status + remote_root_write_ready INSERT probe + syncer role=primary) are sufficient — the write_ready INSERT probe on the candidate is itself the strongest signal.

ShellSpec increments (212 examples / 0 failures, alpha.61 v3 → alpha.62 v1 = +17 net): updated existing fence test to assert NO bypass priv grants + alpha.62 GRANT body grep; updated unfence test to assert per-host invocation + no admin bypass + grant body invariant strong-bind; added enumerate_user_facing_root_hosts stubs to 4 run_switchover tests; new Describe "alpha.62 v1 helpers and verifiers" with compute_grants_sha (2 examples), enumerate_user_facing_root_hosts (2 examples: rc=0 happy + rc!=0 → fail-closed root_host_query_failed), _verify_host_is_fenced (7 examples covering all reason branches: ok_by_local_probe:1044, ok_by_grants_only:localhost_socket_not_attempted, ok_by_grants_only:wildcard_or_remote_not_locally_probable, bypass_priv_residual:READ_ONLY ADMIN, bypass_priv_residual:INSERT user-facing-write, writable_unexpected, grants_query_failed), _verify_host_has_explicit_primary_grant (4 examples: happy path, all_privileges_residual, core_write_priv_missing, admin_bypass_residual:READ_ONLY ADMIN), drift detection (1 example: root_host_list_drift).

v1 caveats carried: cmpd-semisync.yaml UNLOCK paths still GRANT ALL PRIVILEGES (DRIFT D, alpha.63+ scope); alpha.61 v3 caveat (prepare/fence inner SQL helpers no per-call remaining-budget) deferred to alpha.63+; alpha.61 process miss (live gate ACCEPT → execute commit boundary) independently attached to alpha.61 v3 live gate ACCEPT post — not affected by alpha.62 runtime closeout.

alpha.61 v3 fix scope (superseded by alpha.62 v1 above for the switchover-fence contract drift; alpha.61 v3 deadline/timeout/POSIX semantics retained)

alpha.61 v3 fix scope (vs v2 commit `44f55dea`)

v2 shipped two contract gaps that Jack's package-level review caught (Slock thread #mariadb:a09341b9):

Blocker 1 — stage budget computed at entry but stage body unbound. v2 computed prepare_budget / dcs_budget / fence_budget but only used them for log lines. The stage bodies (syncerctl_switchover, the SQL helpers inside prepare_current_primary_for_switchover and fence_current_primary_local_writes_after_dcs) had no wall-clock cap, so the same failure mode v1 had can recur: stage entry budget>0, stage body hangs, kbagent 60s cap kills the action.

Blocker 2 — timeout(1) absence not fail-fast at action entry. v2 set SWITCHOVER_HAS_TIMEOUT=0 and continued through prepare/DCS/fence, failing only at the promote stage. This contradicted the inline comment "absence of timeout fails the action BEFORE we touch DCS" and the 02:01 fail-closed boundary agreement.

v3 fixes (per Jack 02:23 review tightening):

timeout(1) hard dependency at action entry: initialize_action_clock now return 1 with reason=external_timeout_unavailable cause=command_v_timeout_failed when command -v timeout fails, BEFORE any DCS-touching work. The subsequent SWITCHOVER_HAS_TIMEOUT gate is preserved as defense-in-depth.
syncerctl_switchover wraps timeout(1) when caller passes stage_budget: wall = min(SYNCERCTL_PER_CALL_TIMEOUT_SECONDS, dcs_budget). timeout(1) exit codes 124 (default SIGTERM after timeout), 125 (timeout's own error), 137 (SIGKILL via --kill-after, defensive) are mapped to a distinct sentinel reason=syncerctl_timeout stage=dcs stage_budget=Ns rc=R so closeout can tell wall-clock budget exhaustion from a real syncerctl failure (rc!=0 from syncerctl itself or a zero-status non-success message). The legacy naked path is preserved when the caller omits stage_budget.
Per-stage post-body overrun check for prepare / dcs / fence: After each of these stage bodies returns 0, run_switchover re-checks remaining_action_budget. If <=0 (stage body wall-clock exceeded budget) OR if the clock has failed mid-action, emit action_deadline_exhausted_<stage>_overrun + return 1 BEFORE entering the next stage. This bounds the stage body even though the inner SQL helpers do not yet enforce the budget per-call (caveat below).

ShellSpec increments (195 examples / 0 failures, v2 → v3 = +4): updated alpha.61 v2 initialize_action_clock() example for v3 fail-fast at action entry; updated 4 per-stage tests (prepare/dcs/fence + clock-failure mid-prepare) to expect _overrun sentinels (since v3's post-body check fires first); new Describe "alpha.61 v3 syncerctl_switchover() timeout sentinel" (4 examples: rc=124 → syncerctl_timeout, rc=7 → legacy syncerctl exited with rc= sentinel, success path, legacy-naked-path preserved).

v3 caveat (scope cap): inner SQL helpers in prepare/fence stage bodies do NOT yet enforce the stage budget per-call. They use only mariadb client --connect-timeout (5s default). The v3 post-body overrun check catches the wall-clock excess at the stage boundary, but a single inner SQL hang of up to ~stage_budget + connect_timeout can still slip through before the boundary check fires. Per-call SQL helper bounded budget is deferred to alpha.61 v4 / alpha.62 to keep the alpha.60 revoke main path untouched in this round and to avoid regression risk.

alpha.61 v2 fix scope (superseded by v3 above)

alpha.61 v2 fix scope (vs v1 commit `63f91d18`)

v1 shipped two runtime contract holes that Jack's package-level review caught (Slock thread #mariadb:a09341b9):

Blocker 1 — bash-only $SECONDS / $'\n' under #!/bin/sh shebang. replication-switchover.sh declares #!/bin/sh but used bash-only $SECONDS for the deadline expression and $'\n' case patterns for parsing syncerctl multi-line role output. Under dash (the actual runtime sh in the mariadb image) $SECONDS is not auto-incrementing, so (SECONDS - started) evaluates to 0 forever and the polling loops would only be bounded by the kbagent 60s ceiling — defeating the v1 deadline fix entirely. The $'\n' case patterns also do not match in dash. Reproduced locally: SECONDS=<> started=<> elapsed_expr=0.

Blocker 2 — global deadline only enforced on 2 of 5 stages. v1 only checked remaining_action_budget before the candidate-promote and write-probe stages. The earlier prepare / dcs / fence stages had no stage budget and no action_deadline_exhausted_<stage> sentinel — contradicting the v1 commit's stated 4-stage / 5-sentinel contract.

v2 fixes (per Jack 02:00 review tightening):

POSIX wall-clock helpers replace $SECONDS: now_epoch() (POSIX date +%s; rc=2 on failure or non-numeric output — NOT silent 0 fallback), initialize_action_clock() (captures action_started_epoch + probes command -v timeout; date failure is fatal so we never run with a silently broken clock), remaining_action_budget() (rc=2 on clock failure; caller MUST treat as fail-closed), stage_budget_or_exit() (computes min(stage_max, remaining); on remaining<=0 OR clock failure emits action_deadline_exhausted_<stage> + cause=action_clock_unavailable when applicable), extract_syncerctl_role() (POSIX printf | awk line-based parser replacing $'\n' case patterns).
Five-stage deadline enforcement: each of prepare / dcs / fence / promote / write checks remaining_action_budget BEFORE invoking the stage body. Stage budgets are independently configurable (SWITCHOVER_PREPARE_STAGE_BUDGET_SECONDS=10, SWITCHOVER_DCS_STAGE_BUDGET_SECONDS=15, SWITCHOVER_FENCE_STAGE_BUDGET_SECONDS=15, CANDIDATE_PROMOTED_VIA_SYNCERCTL_WAIT_SECONDS=30, CANDIDATE_REMOTE_ROOT_WRITE_PROBE_WAIT_SECONDS=10) and clamped at runtime by remaining_action_budget. Sentinel reasons: action_deadline_exhausted_{prepare,dcs,fence,promote,write}.
External-tool timeout enforcement: wait_candidate_promoted_via_syncerctl explicitly checks SWITCHOVER_HAS_TIMEOUT before entering its loop; if timeout(1) is absent it fails closed with reason=external_timeout_unavailable (NOT silent fallback to unbounded syncerctl call). run_syncerctl_getrole_with_timeout picks min(per_call, stage_budget) so a single syncerctl call cannot exceed the remaining stage budget. SQL probes inherit MARIADB_CONNECT_TIMEOUT_SECONDS on connect and stage budget on the polling loop.
ShellSpec increments (191 examples / 0 failures, v1 → v2 = +25): Describe "alpha.61 v2 POSIX clock helpers" (10 examples covering now_epoch / remaining_action_budget / stage_budget_or_exit), Describe "alpha.61 v2 initialize_action_clock()" (3 examples), Describe "alpha.61 v2 extract_syncerctl_role()" (4 examples), Describe "alpha.61 v2 wait_candidate_promoted_via_syncerctl() timeout-availability gate" (1 example), Describe "run_switchover() alpha.61 v2 per-stage deadline enforcement" (6 examples — one per stage + one mid-action clock-failure example), Describe "alpha.61 v2 POSIX shell self-check" (2 examples: dash -n and bash -n static parse must succeed).

alpha.61 fix scope (v1 — superseded by v2 above)

The alpha.60 switchover N=1 was caught at runtime by the candidate write probe failing within 8s. Triple-source root cause: kbagent action cost 13.004s (NOT 60s cap), 8 attempts of candidate-remote-root-write-ready rc=1, writer double_writable=0, post-failure self-converged. Causal hypothesis (strongly supported but inferential): alpha.59 GREEN was a false-PASS via admin-priv bypass — root held READ_ONLY ADMIN/SUPER/BINLOG ADMIN that bypassed candidate's read_only=1, so INSERT succeeded before candidate was actually promoted. alpha.60 REVOKE removed that bypass; INSERT now requires candidate to be actually writable (read_only=0). The DCS → candidate read_only flip propagation took >8s in this N=1 run.

alpha.61 fix (per Jack 01:40 8-class XP design-contract review):

Action sequence becomes prepare → DCS → fence-old-primary → wait_candidate_promoted_via_syncerctl (NEW) → wait_candidate_remote_root_write_ready. All four steps share a single global deadline (default 55s with 5s buffer below kbagent 60s ceiling). Per-stage budget is min(stage_max, remaining_global_budget).
wait_candidate_promoted_via_syncerctl polls syncerctl getrole on candidate FQDN expecting "primary"; per-attempt log captures role/rc/stderr; sentinels separated per Jack class 4 (role_query_failed, role_unknown, role_not_primary, candidate_fqdn_not_found). Fail-closed reason: candidate_not_promoted_via_dcs_in_budget.
wait_candidate_remote_root_write_ready now captures full SQL stderr per failed attempt (no opaque rc=1) and accepts a stage_deadline parameter clamped by remaining global budget.
roleProbe apply_remote_root_fence "secondary" tightened: removes SUPER from the GRANT list and removes best-effort GRANT READ_ONLY ADMIN + GRANT CONNECTION ADMIN. User-facing root on secondary no longer carries any admin-bypass priv. CONNECTION ADMIN dropped by minimum-priv principle. BINLOG MONITOR / SLAVE MONITOR remain (read-only monitoring); REPLICATION MASTER ADMIN remains (CHANGE MASTER / START SLAVE for follow-time maintenance). kb_internal_root keeps READ_ONLY ADMIN via its own grant chain for secondary_kb_health_check_repair_attempt.
ShellSpec totals 166 examples / 0 failures / 0 warnings, with new wait_candidate_promoted_via_syncerctl describe (5 examples covering all sentinels) + new run_switchover() alpha.61 global deadline example asserting earlier-stage deadline burn fails the next stage closed.

alpha.60 v3 fix scope

The v2 build had a class 1 silent fallback in the host enumeration query: || true swallowed any rc!=0 from SELECT Host FROM mysql.user, so a permission/connection/SQL failure was silently treated as "root account does not exist" and the function returned 0 having done nothing. v3 captures stdout AND rc explicitly: rc!=0 → reason=root_host_query_failed, fail-closed; rc=0 with empty stdout → root_account_not_found skip (unchanged); rc=0 with non-empty stdout → per-host / per-priv loop (unchanged). ShellSpec adds 1 new example asserting that a mocked SELECT-Host rc=1 causes the function to fail-closed without entering REVOKE / FLUSH / verify. Total: 160 examples / 0 failures / 0 warnings.

alpha.60 v2 fix scope

The first alpha.60 build (commit 6efe0c60) was caught at design-contract review for two gaps; both are closed in 0cf4a481:

Per-host batched REVOKE could short-circuit on first 1141 and silently mark the host as already-fenced even when SUPER and BINLOG ADMIN survived. v2 splits into per-privilege REVOKE in a fixed inner loop, with separate sentinel reasons per privilege; 1141 on a single privilege only marks that priv absent. After all per-priv REVOKEs for a host, a second SHOW GRANTS asserts no bypass priv survives — if any does, the host is revoke_residual_bypass and the function fail-closes regardless of per-priv counts.
The rollback path unfence_local_remote_root_for_primary still issued GRANT ALL PRIVILEGES. v2 grants the same explicit non-bypass privilege list that the roleProbe primary path uses, so a future switchover does not have to re-fight admin bypass introduced by rollback.
ShellSpec totals 159 examples / 0 failures / 0 warnings, including the new partial-1141 trip-wire test (READ_ONLY ADMIN 1141 + post-revoke SHOW GRANTS still has SUPER → fail-closed).

alpha.60 review-fix scope

The first alpha.59 build reached live-test and passed bootstrap N=1 but switchover N=1 was caught at runtime by verify_post_dcs_local_root_write_fenced (added in alpha.59 v2 per design-contract review): user-facing root local INSERT succeeded after read_only=ON. Triple-source root cause: kbagent action cost 2.793s (NOT 60s cap; alpha.59 contract truncation works), action stderr exact-match, SHOW GRANTS FOR root@% contains READ_ONLY ADMIN, mysql.user shows root@127.0.0.1/root@localhost Insert_priv=Y Super_priv=Y. Causal chain through documented MariaDB 10.11+ privilege semantics: GRANT ALL PRIVILEGES bundles READ_ONLY ADMIN/SUPER/BINLOG ADMIN which bypass @@global.read_only=ON.

alpha.60 fix (per Jack 23:28 8-class XP design-contract review):

New revoke_user_facing_root_admin_privileges_for_secondary enumerates mysql.user for actual root host rows, REVOKEs each bypass priv (READ_ONLY ADMIN/SUPER/BINLOG ADMIN) by name with distinct sentinel reasons; 1141 treated as already-fenced; any other REVOKE error fail-closed; called from fence_current_primary_local_writes_after_dcs between local_read_only_is "1" and verify_post_dcs_local_root_write_fenced. kb_internal_root intentionally OUT of scope (preserves alpha.59 secondary roleProbe 1062 repair).
apply_remote_root_fence "primary" in roleprobe.sh: GRANT ALL PRIVILEGES replaced with explicit privilege list excluding SUPER/READ_ONLY ADMIN/BINLOG ADMIN; GRANT OPTION only via trailing WITH GRANT OPTION clause (not in privilege list — syntax-error in some MariaDB versions). Prevents alpha.61 from re-introducing the same bypass via normal role transitions.
ShellSpec totals 157 examples / 0 failures / 0 warnings, with new sentinel coverage and negative trip-wire that verify probe is not called when revoke fails.
Caveat: cmpd-semisync.yaml's set_local_root_account_state / set_remote_root_account_state UNLOCK paths still re-grant ALL PRIVILEGES; those are runtime sql-listener-fence transitions, not switchover-time. alpha.60 trusts switchover-time revoke as the immediate fix; comprehensive cleanup is alpha.61+ scope.
Rendered switchover script sha256: 8fe6de1e07e60ed99073daea83b3c0734a591540c2d2e639aef58955e5826c6b.
Rendered roleprobe script sha256: 765124d7d2570a66c54d94737057d2d71cad93e6fba4c6ee055da83c38663a8f.

alpha.59 v2 review-fix scope

The first alpha.59 build (commit 1c4310f8) was caught at pair design-contract review for two gaps; both are closed in a1f2a064:

fence_current_primary_local_writes_after_dcs now ends with verify_post_dcs_local_root_write_fenced, which runs a localhost user-facing root INSERT and accepts only an explicit 1290 / read-only rejection. Read_only=ON without a verified write rejection no longer counts as a passed fence.
secondary_kb_health_check_repair_attempt no longer toggles @@global.read_only (previously it briefly set OFF then back ON for the maintenance DELETE). The repair runs only through kb_internal_root (which holds READ_ONLY ADMIN) so the secondary fence remains continuously enforced and there is no double-writable window during repair. ShellSpec gains both negative assertions and now totals 150 examples / 0 failures / 0 warnings.

Latest focused evidence

Chart 1.1.1-alpha.58 live gate and fresh bootstrap/role-publish each passed only their single scopes. Bootstrap converged in eleven bounded-gate attempts after Cluster transitioned through Creating and a transient Failed phase before reaching Running with one published primary, single Primary Service endpoint, and healthy IO/SQL on the secondary via kb_internal_root.
Chart 1.1.1-alpha.58 switchover and role-transition under load, 1 sample, failed only that scope and the namespace was preserved for inspection. The OpsRequest validated and started, then was reported timedOut: action timed-out. The writer saw 62 successful writes, 27 failures, 23 no-primary samples, and zero confirmed double-writable samples; final row counts on both pods were equal so no data-loss conclusion was supported.
Source-code review of apecloud/kubeblocks HEAD pkg/kbagent/service/action_utils.go:54-64,176-184 shows maxActionCallTimeout = 60 * time.Second is hardcoded and actionCallTimeoutContext uses min(timeout, maxActionCallTimeout). Any CmpD lifecycleActions.<action>.timeoutSeconds greater than 60 is silently truncated. Runtime evidence (kbagent log cost: 60060 ms plus result: timedOut on the action HTTP call) confirmed the action script was killed at the 60s mark right after candidate remote root write probe rc=0.
Follow-up fix in chart 1.1.1-alpha.59 declares switchover.timeoutSeconds: 60 so the contract reflects what kbagent enforces, shrinks the action to the three steps that fit inside the ceiling, makes the candidate remote root write probe synchronous and bounded, and migrates kb_health_check 1062/1146 repair into the secondary roleProbe path with a precise signature trigger.

Boundary

These results cover the listed scopes only.
This is not a release pass claim, not a long-running soak claim, and not a full operation matrix claim.
A separate KB issue/RFC will be filed to either remove the kbagent 60s hard cap or surface a CmpD validation error when timeoutSeconds > 60. That work is not in this addon PR and does not block alpha.59 retest.

Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources. Harden semisync startup, role publication, switchover fencing, and script distribution. Add shell specs for replication member join, role probe, switchover, and standalone template mapping.

codecov-commenter · 2026-05-09T08:03:30Z

Codecov Report

❌ Patch coverage is 0% with 1208 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (69b3b6d) to head (77bfef3).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...iadb/scripts-ut-spec/replication_roleprobe_spec.sh	0.00%	458 Missing ⚠️
...db/scripts-ut-spec/replication_member_join_spec.sh	0.00%	432 Missing ⚠️
...pts-ut-spec/semisync_rejoin_fence_template_spec.sh	0.00%	281 Missing ⚠️
...cripts-ut-spec/standalone_template_mapping_spec.sh	0.00%	37 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #2633     +/-   ##
=======================================
  Coverage   0.00%   0.00%             
=======================================
  Files         73      78      +5     
  Lines       9197   12334   +3137     
=======================================
- Misses      9197   12334   +3137

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Keep the KubeBlocks health-check table schema on fresh replicas and clear only local rows before starting or repairing SQL replication. This prevents the replica repair path from changing a duplicate-key error into a missing-table replication error.

Require internal local admin read-only privileges before role decisions. Track primary read/write readiness after local root unlock and read_only repair. Repair syncer primary reconciliation when the listener is already exposed but local write readiness is missing.

KB kbagent enforces a hardcoded `maxActionCallTimeout = 60 * time.Second` in `pkg/kbagent/service/action_utils.go::actionCallTimeoutContext`, so any CmpD `switchover.timeoutSeconds` greater than 60 is silently truncated. alpha.58 declared 240; live-test evidence (cost=60060ms result=timedOut on the kbagent action HTTP call) confirmed the action script was killed mid-flight at exactly 60 seconds. alpha.59 contract: * CmpD `switchover.timeoutSeconds: 240 -> 60` in cmpd-semisync.yaml and cmpd-replication.yaml so the declared contract reflects what kbagent actually enforces. * `run_switchover` shrinks to three required steps that all must fit inside the 60s ceiling: 1. `prepare_current_primary_for_switchover` (local prep, ~3s) 2. `syncerctl_switchover` (DCS record, ~5s) 3. `fence_current_primary_local_writes_after_dcs` (local read_only fence, ~1s) - retained synchronously because it is the double-writable defense and must be true before action returns 4. `wait_candidate_remote_root_write_ready` (bounded ~8s probe, fail-closed) - the third leg of the action-success contract: never return 0 with a non-writable candidate. * `wait_switchover_done`, `wait_post_switchover_stabilization`, `wait_primary_service_routes_candidate`, `wait_current_secondary_remote_root_fenced` are no longer invoked from `run_switchover`. Post-DCS convergence is delegated to roleProbe + KB endpoint controller. The negative assertion that none of these helpers fires lives in the new `replication_switchover_spec.sh` "alpha.59 contract" tests. * `kubeblocks.kb_health_check` 1062/1146 repair migrates from the switchover action wait loops into the secondary roleProbe path (`secondary_kb_health_check_repair_attempt` in `replication-roleprobe.sh`). The repair has a precise signature (`Last_(SQL_)?Errno: 1062|1146` AND `kubeblocks.kb_health_check` in the slave error text), uses `kb_internal_root` (READ_ONLY ADMIN), is best-effort, idempotent, and logs each attempt with rc. Other SQL errors are NOT swallowed. * ShellSpec gains six new examples for `secondary_kb_health_check_repair_attempt` covering the precise signature, the cli-user choice, the wrong-table negative case, the wrong-errno negative case, and the empty-status negative case, plus six examples for `slave_status_has_kb_health_check_repairable_error`. * Two new `run_switchover` examples exercise the alpha.59 contract: the negative assertion that the wait_* helpers are never invoked, and the fail-closed path when the candidate write probe does not close inside the bounded budget. Three obsolete examples (which exercised `wait_switchover_done` directly) are removed. * The runner-side post-OpsRequest convergence gate is test-runner-owned (separate change; out of this addon patch). References: - apecloud/kubeblocks `pkg/kbagent/service/action_utils.go:64` (`maxActionCallTimeout = 60 * time.Second`) - addon-test-runner-write-after-bounded-role-gate guide - bootstrap-runner-preload-after-bounded-role-gate-case

Two design-contract gaps caught in pair review: 1. fence_current_primary_local_writes_after_dcs previously verified only @@global.read_only=1, never that a user-facing root INSERT was actually rejected by the read-only fence. The contract field was non-empty but unenforced at the write site (xp design-contract class 2). Add verify_post_dcs_local_root_write_fenced: runs a localhost user-facing root INSERT into kubeblocks.kb_post_dcs_fence_probe and requires either rc=0 (fence not enforced -> fail closed) or rc!=0 with stderr containing 1290/read-only (fence verified). Other failure modes (no client, unrelated SQL error) also fail closed. Documentation in the function header records the contract change. 2. secondary_kb_health_check_repair_attempt previously did SET GLOBAL read_only=OFF -> DELETE -> SET GLOBAL read_only=ON, creating a small but real write window during which any client could have written to the secondary. This contradicts the double_writable=0 invariant the post-OpsRequest convergence test is meant to prove. Remove the read_only flip entirely: the repair now uses kb_internal_root (which holds READ_ONLY ADMIN from the addon's remote-root-fence path) and writes through while @@global.read_only=1 stays in place. If kb_internal_root cannot write for any reason, log rc and return; the next roleProbe tick re-evaluates. ShellSpec changes: * New Describe "verify_post_dcs_local_root_write_fenced()" with 4 examples: 1290 rejection -> success; rc=0 -> fail-closed; unrelated error -> fail-closed; no client binary -> fail-closed. * secondary_kb_health_check_repair_attempt "alpha.59 invariant" example now negative-asserts on SET GLOBAL read_only=OFF and SET GLOBAL read_only=ON. The earlier "fires repair" example drops the now-incorrect positive assertions for those two SQL statements. * Existing happy-path run_switchover examples gain a verify_post_dcs_local_root_write_fenced stub (return 0) so the fence verification still passes inside the SQL-mock environment. Total: 150 examples, 0 failures, 0 warnings.

…witchover alpha.59 switchover N=1 RED with first-blocker = addon product / switchover post-DCS root fence contract. Triple-source evidence: kbagent action cost 2.793s (NOT 60s cap; alpha.59 contract truncation works), action stderr "post-DCS local-root write fence not enforced; user-facing root INSERT succeeded after read_only=ON", SHOW GRANTS FOR root@% contains READ_ONLY ADMIN, mysql.user shows root@127.0.0.1/root@localhost Insert_priv=Y Super_priv=Y. Causal chain: addon apply_remote_root_fence "primary" granted ALL PRIVILEGES (which in MariaDB 10.11+ bundles READ_ONLY ADMIN / SUPER / BINLOG ADMIN), so user-facing root bypassed @@global.read_only=ON; the alpha.59 verify_post_dcs_local_root_write_fenced caught it. This gap existed in alpha.58 too but was masked by the absence of a verify probe. alpha.60 hard contract (per Jack 23:28 8-class XP review): * New revoke_user_facing_root_admin_privileges_for_secondary in replication-switchover.sh: - Enumerates mysql.user for actual root host rows (does not hardcode %/127.0.0.1/localhost; covers whatever the live DB actually has) - For each host: SHOW GRANTS first; if READ_ONLY ADMIN / SUPER / BINLOG ADMIN / ALL PRIVILEGES is present, REVOKE each bypass priv by name (never REVOKE ALL PRIVILEGES, never REVOKE GRANT OPTION as a privilege) - Distinct sentinel reasons per Jack class 4 (root_account_not_found, privilege_absent_already_fenced, revoked, revoke_failed) so closeout can attribute precisely - 1141 (no such grant) on REVOKE is treated as already-fenced; any other REVOKE error is fail-closed (Jack class 1: never silent fallback) - kb_internal_root is intentionally OUT of scope; it must keep READ_ONLY ADMIN for the alpha.59 secondary roleProbe 1062 repair path - All SQL is via the kb_internal_root client (ROOT_LOCAL bypass not used; revoking your own privilege mid-statement is risky) - FLUSH PRIVILEGES + mysql.user snapshot logged at end * fence_current_primary_local_writes_after_dcs gains the revoke step between local_read_only_is "1" and verify_post_dcs_local_root_write_fenced. Failed revoke -> immediate return 1; no partial fence. * apply_remote_root_fence "primary" in replication-roleprobe.sh: the GRANT ALL PRIVILEGES is replaced with an explicit privilege list that EXCLUDES SUPER / READ_ONLY ADMIN / BINLOG ADMIN. GRANT OPTION is now only via the trailing WITH GRANT OPTION clause (per Jack: putting it in the comma-separated privilege list is a syntax error in some MariaDB versions). This prevents alpha.61 from re-introducing the same bypass through normal role transitions. ShellSpec increments (10 new examples, 0 failures, 0 warnings, 157 total): * Describe "revoke_user_facing_root_admin_privileges_for_secondary()" 6 examples covering each sentinel: account-not-found skip, multi-host revoke success, multi-host with one fail-closed, 1141 already-fenced, no-bypass-priv already-fenced, no-client fail-closed * Describe "fence_current_primary_local_writes_after_dcs() revoke fail-closed" 1 example asserts verify probe is NOT called when revoke fails (negative trip-wire) * Existing happy-path run_switchover examples gain revoke_user_facing_root_admin_privileges_for_secondary stub (return 0) alongside the existing verify_post_dcs stub * roleprobe primary fence example asserts the new grant: REVOKE ALL PRIVILEGES present, GRANT ALL PRIVILEGES NOT present, SUPER NOT present, READ_ONLY ADMIN NOT present, BINLOG ADMIN NOT present, ", GRANT OPTION," (in the privilege list) NOT present, WITH GRANT OPTION (trailing clause) present Caveat: cmpd-semisync.yaml's set_local_root_account_state and set_remote_root_account_state UNLOCK paths still re-grant ALL PRIVILEGES; those are runtime sql-listener-fence transitions, not switchover-time operations. Post-switchover their re-grant would have to be revoked again on next switchover. Cleaning those up is alpha.61+ scope; alpha.60 trusts switchover-time revoke as the immediate fix. References: - alpha.59 RED closeout msg 80e3b77c (4-source confirmation) - alpha.59 design contract review msg 9e722fa8 (8-class) - addon-test-runner-write-after-bounded-role-gate-guide.md (companion methodology for the fence-correctness invariant)

…residual check alpha.60 v1 (commit 6efe0c6) had a class 4 / class 1 contract gap caught in pair review: * All three bypass privileges (READ_ONLY ADMIN, SUPER, BINLOG ADMIN) were REVOKEd in a single SQL batch per host. If the first REVOKE returned 1141 (no such grant) the batch could short-circuit, leaving SUPER and BINLOG ADMIN un-revoked, and the code recorded the entire host as privilege_absent_already_fenced. This is exactly the false-safety window the alpha.60 contract was meant to close. * The rollback path unfence_local_remote_root_for_primary still issued GRANT ALL PRIVILEGES, conflicting with the design direction that user-facing root must not carry admin bypass privileges between role transitions. alpha.60 v2 fixes: revoke_user_facing_root_admin_privileges_for_secondary now performs per-privilege REVOKE in a fixed inner loop (READ_ONLY ADMIN, SUPER, BINLOG ADMIN), and records a separate sentinel reason per privilege. 1141 on a single privilege is treated as that privilege absent only; SUPER and BINLOG ADMIN are still attempted. After all per-privilege REVOKEs for a host finish, the function re-issues SHOW GRANTS for that host and asserts no bypass privilege survives; if any does, the host is marked revoke_residual_bypass and the function fail-closes regardless of per-privilege rc. The defense-in-depth means even an intermediate SHOW GRANTS that missed a bypass priv cannot create a silent pass. unfence_local_remote_root_for_primary (rollback path) now issues the same explicit non-bypass GRANT list that the roleProbe primary path uses. SUPER, READ_ONLY ADMIN, BINLOG ADMIN are excluded from the privilege list and GRANT OPTION is only in the trailing WITH GRANT OPTION clause. This prevents the rollback path from re-introducing admin bypass that the next switchover would have to fight. ShellSpec: * Three v1 examples updated to the new per-privilege expectations (multi-host happy path with alternating SHOW GRANTS responses; 1141-on-one-priv now requires SUPER and BINLOG ADMIN to still be attempted; defense-in-depth for the no-bypass-priv-from-start case). * New example: 1141 on READ_ONLY ADMIN, REVOKE SUPER appears to succeed, but post-revoke SHOW GRANTS still shows SUPER -> revoke_residual_bypass -> fail-closed. * New example: unfence_local_remote_root_for_primary issues REVOKE ALL PRIVILEGES + explicit non-bypass GRANT list; assertions that GRANT ALL PRIVILEGES, SUPER, READ_ONLY ADMIN, BINLOG ADMIN, and ", GRANT OPTION," are all absent. * Total: 159 examples, 0 failures, 0 warnings. Caveat unchanged: cmpd-semisync.yaml's set_local_root_account_state and set_remote_root_account_state UNLOCK paths still issue GRANT ALL PRIVILEGES. These are runtime sql-listener-fence transitions, not switchover-time. alpha.60 trusts the post-DCS revoke as the immediate fix; comprehensive cleanup of those paths is alpha.61+ scope.

alpha.60 v2 (commit 0cf4a48) had a class 1 silent fallback caught in pair review: The host enumeration query hosts=$(mariadb ... -e "SELECT Host FROM mysql.user WHERE User='${root_user}';" \ 2>/dev/null || true) if [ -z "${hosts}" ]; then reason=root_account_not_found return 0 fi collapses two different states into one skip: * rc=0 with empty stdout = genuinely no root account row -> skip OK * rc!=0 (permission denied / connection broken / SQL error) = enumeration itself failed -> the function returned 0 having done nothing, then the downstream local-root write probe could only prove the current connection path is fenced, not that root@localhost / root@127.0.0.1 / root@% were actually enumerated and revoked. This is a class 1 silent fallback against the alpha.60 contract, which is "enumerate actual root host rows then per-host per-priv revoke". alpha.60 v3 fix: * Capture the host query stdout AND rc (no `|| true`, no 2>/dev/null swallow) * rc != 0 -> log reason=root_host_query_failed with rc and stderr; return 1 fail-closed * rc == 0 with empty stdout -> reason=root_account_not_found, skip rc=0 * rc == 0 with non-empty stdout -> proceed into per-host / per-priv loop ShellSpec: 1 new example asserts that a mocked SELECT-Host failure (mariadb returns rc=1 with permission-denied stderr) causes the function to fail- closed without entering REVOKE / FLUSH / verify. Total: 160 examples / 0 failures / 0 warnings. Caveat unchanged: cmpd-semisync.yaml runtime sql-listener-fence UNLOCK paths still re-grant ALL PRIVILEGES; alpha.61+ scope.

…dline alpha.60 switchover N=1 RED with first-blocker = addon product / candidate remote root write probe did not close within 8s. Triple-source evidence: kbagent action cost 13.004s (NOT 60s cap), action stderr "candidate remote root write probe did not close ... within 8s" with 8 attempts rc=1, writer double_writable=0, post-failure system self- converged. Causal hypothesis (strongly supported but inferential): alpha.59 fresh rerun GREEN was a false-PASS via admin-priv bypass — root held READ_ONLY ADMIN/SUPER/BINLOG ADMIN that bypassed candidate read_only=1, so INSERT succeeded before candidate was actually promoted. alpha.60 REVOKE removed that bypass; INSERT now requires candidate to have read_only=0, which means the action must observe DCS-side promotion before testing writability. The DCS switchover record → candidate read_only flip propagation took >8s in this N=1 run. alpha.61 fixes (per Jack 01:40 8-class XP design-contract review): action sequence becomes prepare → DCS → fence-old-primary → wait_candidate_promoted_via_syncerctl (NEW) → wait_candidate_remote_root_write_ready, all four steps sharing a single global deadline (default 55s, leaving 5s buffer below kbagent 60s ceiling). Per-stage budget is the smaller of its configured maximum and the remaining global budget. Stage timeouts each have distinct fail-closed reasons (action_deadline_exhausted_*). wait_candidate_promoted_via_syncerctl polls syncerctl getrole on the candidate FQDN expecting "primary"; per-attempt log records role / rc / stderr; sentinels distinct per Jack class 4 (role_query_failed, role_unknown, role_not_primary, candidate_fqdn_not_found). Fail-closed reason: candidate_not_promoted_via_dcs_in_budget. wait_candidate_remote_root_write_ready (existing) now captures full SQL stderr per failed attempt instead of opaque rc=1, and accepts a stage deadline parameter so the caller can clamp it by remaining global budget. roleProbe apply_remote_root_fence "secondary" tightened: removes SUPER from the GRANT list (was bundled in the secondary follow-time grant), and removes the best-effort GRANT READ_ONLY ADMIN and GRANT CONNECTION ADMIN that were applied after the main grant. The legitimate need for read_only-bypass on secondary (kb_health_check 1062 repair) uses kb_internal_root in secondary_kb_health_check_repair_attempt; user-facing root no longer carries any admin-bypass priv. CONNECTION ADMIN dropped by minimum-priv principle. BINLOG MONITOR and SLAVE MONITOR remain as read-only monitoring privileges. REPLICATION MASTER ADMIN remains so the secondary can run CHANGE MASTER / START SLAVE for follow-time maintenance. ShellSpec increments (6 new examples, 0 failures, 0 warnings, 166 total): * Describe "wait_candidate_promoted_via_syncerctl()" 5 examples cover: immediate primary success, empty FQDN fail-closed, rc!=0 retry-then-success with stderr captured, not-primary retry-then-success, stage budget exhaustion fail-closed. * New Describe "run_switchover() alpha.61 global deadline" 1 example asserts that an earlier stage (prepare) consuming the deadline causes the next stage entry to fail-closed with action_deadline_exhausted_*. * Existing happy-path run_switchover examples gain wait_candidate_promoted_via_syncerctl stub (return 0) alongside the existing verify_post_dcs_local_root_write_fenced and revoke_user_facing_root_admin_privileges_for_secondary stubs. * Existing alpha.59 contract example for write probe budget exhaustion updated to match the new sentinel reason (candidate_remote_root_write_not_ready_in_budget). * roleProbe secondary fence example asserts the new grant: REVOKE ALL PRIVILEGES + GRANT non-bypass list (no SUPER, no READ_ONLY ADMIN, no BINLOG ADMIN, no CONNECTION ADMIN); BINLOG MONITOR + SLAVE MONITOR still present. Caveat unchanged: cmpd-semisync.yaml runtime sql-listener-fence UNLOCK paths still re-grant ALL PRIVILEGES; alpha.62+ scope.

…ment alpha.61 v1 (63f91d1) shipped with two runtime contract holes that Jack's package-level review caught (msg b7a1e283 + 02:00 follow-up 7423d58d): Blocker 1 — bash-only $SECONDS / $'\n' under #!/bin/sh shebang: replication-switchover.sh declares #!/bin/sh but used bash-only $SECONDS for the deadline expression and $'\n' case patterns for parsing syncerctl multi-line role output. Under dash (the actual runtime sh in the mariadb image) $SECONDS is not auto-incrementing, so (SECONDS - started) evaluates to 0 forever and the polling loops would only be bounded by the kbagent 60s ceiling — defeating the v1 deadline fix entirely. The $'\n' case patterns also do not match in dash. Reproduced locally: SECONDS=<> started=<> elapsed_expr=0. Blocker 2 — global deadline only enforced on 2 of 5 stages: v1 only checked remaining_action_budget before the candidate-promote and write-probe stages. The earlier prepare / dcs / fence stages had no stage budget and no action_deadline_exhausted_<stage> sentinel — contradicting the v1 commit's stated 4-stage / 5-sentinel contract. alpha.61 v2 fixes: POSIX wall-clock helpers replace $SECONDS: * now_epoch() — POSIX `date +%s`; rc=2 on failure or non-numeric output (NOT silent 0 fallback). * initialize_action_clock() — captures action_started_epoch + probes `command -v timeout`; date failure is fatal so we never run with a silently broken clock. * remaining_action_budget() — rc=2 on clock failure (caller MUST treat as fail-closed, NOT as "0 seconds remaining"). * stage_budget_or_exit() — computes min(stage_max, remaining); on remaining<=0 OR clock failure, emits action_deadline_exhausted_<stage> + cause=action_clock_unavailable when applicable. * extract_syncerctl_role() — POSIX `printf|awk` line-based parser replacing $'\n' case patterns. * run_syncerctl_getrole_with_timeout() — wraps syncerctl with `timeout <wall>` where wall=min(per_call, stage_budget); caller must verify SWITCHOVER_HAS_TIMEOUT=1 first. Five-stage deadline enforcement (per Jack 02:00 review #2): Each of prepare / dcs / fence / promote / write checks remaining_action_budget BEFORE invoking the stage body. Stage budgets are independently configurable (SWITCHOVER_PREPARE_STAGE_BUDGET_SECONDS=10, SWITCHOVER_DCS_STAGE_BUDGET_SECONDS=15, SWITCHOVER_FENCE_STAGE_BUDGET_SECONDS=15, CANDIDATE_PROMOTED_VIA_SYNCERCTL_WAIT_SECONDS=30, CANDIDATE_REMOTE_ROOT_WRITE_PROBE_WAIT_SECONDS=10) and clamped at runtime by remaining_action_budget. Sentinel reasons are action_deadline_exhausted_{prepare,dcs,fence,promote,write}. External-tool timeout enforcement (per Jack 02:00 review #2): wait_candidate_promoted_via_syncerctl explicitly checks SWITCHOVER_HAS_TIMEOUT before entering its loop; if `timeout(1)` is absent it fails closed with reason=external_timeout_unavailable (NOT silent fallback to unbounded syncerctl call). The wrapper picks min(per_call, stage_budget) so a single syncerctl call cannot exceed remaining stage budget. SQL probes inherit MARIADB_CONNECT_TIMEOUT_SECONDS on connect and stage budget on the polling loop. ShellSpec increments (191 examples, 0 failures, up from v1 166): * Describe "alpha.61 v2 POSIX clock helpers": now_epoch (3 examples incl. rc=2 on date failure / non-numeric), remaining_action_budget (3 examples incl. rc=2 on missing started_epoch / clock failure), stage_budget_or_exit (4 examples). * Describe "alpha.61 v2 initialize_action_clock()" (3 examples): HAS_TIMEOUT=1 / 0 detection, date failure → fail-closed. * Describe "alpha.61 v2 extract_syncerctl_role()" (4 examples): single-line / multi-line POSIX parsing, no partial-match confusion. * Describe "alpha.61 v2 wait_candidate_promoted_via_syncerctl() timeout-availability gate" (1 example). * Describe "run_switchover() alpha.61 v2 per-stage deadline enforcement" (6 examples): one per stage (prepare/dcs/fence/promote/write) + 1 mid-action clock-failure example. * Describe "alpha.61 v2 POSIX shell self-check" (2 examples): `dash -n` and `bash -n` static parse must succeed. Caveat unchanged: cmpd-semisync.yaml runtime sql-listener-fence UNLOCK paths still re-grant ALL PRIVILEGES; alpha.62+ scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ntinel alpha.61 v2 (44f55de) shipped two contract gaps that Jack's package-level review caught (msg bbf30db6 + 02:23 follow-up 8b1f12fe): Blocker 1 — stage budget computed at entry but stage body unbound: v2 computed prepare_budget / dcs_budget / fence_budget but only used them for log lines. The stage bodies (syncerctl_switchover, the SQL helpers inside prepare_current_primary_for_switchover and fence_current_primary_local_writes_after_dcs) had no wall-clock cap, so the same failure mode v1 had can recur: stage entry budget>0, stage body hangs, kbagent 60s cap kills the action. Blocker 2 — timeout(1) absence not fail-fast at action entry: v2 set SWITCHOVER_HAS_TIMEOUT=0 and continued through prepare/DCS/fence, failing only at the promote stage. This contradicted the inline comment "absence of timeout fails the action BEFORE we touch DCS" and the 02:01 fail-closed boundary agreement. alpha.61 v3 fixes (per Jack 02:23 review tightening): 1. timeout(1) hard dependency at action entry: initialize_action_clock now `return 1` with reason=external_timeout_unavailable when `command -v timeout` fails, BEFORE any DCS-touching work. The subsequent SWITCHOVER_HAS_TIMEOUT gate is preserved as defense-in-depth. 2. syncerctl_switchover wraps timeout(1) when caller passes stage_budget: wall = min(SYNCERCTL_PER_CALL_TIMEOUT_SECONDS, dcs_budget). timeout(1) exit codes 124 (default SIGTERM after timeout), 125 (timeout's own error), 137 (SIGKILL via --kill-after, defensive) are mapped to a distinct sentinel `reason=syncerctl_timeout stage=dcs stage_budget=Ns rc=R` so closeout can tell wall-clock budget exhaustion from a real syncerctl failure (rc!=0 from syncerctl itself or a zero-status non-success message). The legacy naked path is preserved when the caller omits stage_budget (no current production caller does, but tests exercise both). 3. Per-stage post-body overrun check for prepare / dcs / fence: After each of these stage bodies returns 0, run_switchover re-checks remaining_action_budget. If <=0 (stage body wall-clock exceeded budget) OR if the clock has failed mid-action, emit the distinct sentinel action_deadline_exhausted_<stage>_overrun + return 1 BEFORE entering the next stage. This bounds the stage body even though the inner SQL helpers do not yet enforce the budget per-call (caveat below). ShellSpec increments (195 examples, 0 failures, up from v2 191): * Updated Describe "alpha.61 v2 initialize_action_clock()" example: v3: timeout(1) absence is now fail-closed at action entry with reason=external_timeout_unavailable cause=command_v_timeout_failed (instead of v2's silent SWITCHOVER_HAS_TIMEOUT=0 + return 0). * Updated Describe "run_switchover() alpha.61 v2 per-stage deadline enforcement" examples for prepare / dcs / fence: each now expects action_deadline_exhausted_<stage>_overrun (since v3's post-body check fires first when the prior stage body burns the deadline). The mid- action clock-failure example now expects action_deadline_exhausted_prepare_overrun (the post-prepare check catches the broken clock). * New Describe "alpha.61 v3 syncerctl_switchover() timeout sentinel" (4 examples): rc=124 → emits reason=syncerctl_timeout with full attribution; rc=7 from syncerctl itself → legacy 'syncerctl exited with rc=' sentinel (NOT timeout); success path preserved; legacy callers (no stage_budget arg) still hit the naked path with no timeout wrapper invocation. Caveat (v3 scope cap): inner SQL helpers in prepare/fence stage bodies do NOT yet enforce the stage budget per-call. They use only mariadb client --connect-timeout (5s default). The v3 post-body overrun check catches the wall-clock excess at the stage boundary, but a single inner SQL hang of up to ~stage_budget+connect_timeout can still slip through before the boundary check fires. Per-call SQL helper bounded budget is deferred to alpha.61 v4 / alpha.62 to keep alpha.60 revoke main path untouched in this round and to avoid regression risk. Caveat carried from v2: cmpd-semisync.yaml runtime sql-listener-fence UNLOCK paths still re-grant ALL PRIVILEGES; alpha.62+ scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….62 v1) alpha.61 switchover/role-transition under-load N=1 came back RED at the pre-DCS local_remote_root_is_fenced_for_secondary verifier (Jack closeout msg 40e83143): Switchover failed: current primary remote root fence was not verified before DCS switchover 5-layer排除排到 product layer; first blocker = addon switchover action pre-DCS remote-root fence + rollback verifier contract drift between switchover-side callsites and roleProbe-side callsites that were tightened in alpha.61 v3 but did NOT propagate. Direct evidence: * kbagent action cost 1.945s (NOT 60s cap; no v3 deadline/timeout sentinel hit — only entered prepare stage) * writer double_writable=0 (race not surfaced; fence semantics actually in tightening — action fail-closed correctly) * data 54/1485 vs 54/1485 (no data loss; no data-loss conclusion is written in this RED cell) * preserved scene mariadb-task427-switchover-alpha61-sw-n1-n1-194559 remains untouched until alpha.62 ships alpha.62 v1 fixes (per Jack 04:08 v1 design review + 04:10 v2 ACCEPT + 04:12 6 review focal points + 04:13 boundary lock): DRIFT A — switchover pre-DCS supplementary admin grant: Removed grant_remote_root_optional_admin_privileges_for_secondary entirely. fence_local_remote_root_for_secondary previously called it immediately after the main fence, granting BINLOG ADMIN / CONNECTION ADMIN / READ_ONLY ADMIN back to user-facing root — defeating alpha.61 secondary fence tightening in the same callsite. DRIFT B — rollback verifier requires GRANT ALL PRIVILEGES that alpha.60 v2 unfence no longer grants: Renamed remote_root_has_full_access → remote_root_has_explicit_primary_grant. New verifier reads grants via kb_internal_root view, requires the core write subset (INSERT/UPDATE/DELETE/CREATE/DROP), rejects GRANT ALL PRIVILEGES, rejects admin bypass privileges. DRIFT C — local_remote_root_is_fenced_for_secondary observability gap + 口径漂移: Replaced with strong-semantics observable per-host verifier: * reads grants via kb_internal_root (avoids root self-query loop) * explicit reject of bypass privileges and user-facing write privileges * structured single-line log with grants_sha (sha256 → sha1 → md5 → unavailable:hash_tool_unavailable fallback chain), grants_bypass list, write_probe_attempted, write_probe_rc, write_probe_errno, verified_host, probe_host attribution, reason * full grants dump after sentinel line on failure * 127.0.0.1 host: TCP write probe expecting 1044/1290 errno * localhost host: grants-only (no socket probe attempted) * % wildcard host: grants-only (not locally probable) * distinct reason values: ok_by_local_probe:<errno> / ok_by_grants_only:<why> / grants_query_failed / bypass_priv_residual:<list> / writable_unexpected / probe_account_mismatch / account_grants_empty_or_1141 / account_not_fenced Per-host enumeration (Jack 04:08 Blocker 1 Option B): replaces single-host root@${MARIADB_ROOT_HOST:-%} fence with mysql.user enumeration through kb_internal_root. host_list is read ONCE in prepare_current_primary_for_switchover and passed to fence + verifier; functions called externally fall back to self-enumerate with drift detection (sha mismatch → fail-closed root_host_list_drift, NEVER silent). `timeout(1)` and `command -v` semantics unchanged from alpha.61 v3. Per Jack 04:08 Tightening 2 (DRIFT D out-of-scope): cmpd-semisync.yaml runtime sql-listener-fence UNLOCK paths still GRANT ALL PRIVILEGES — remains alpha.63+ scope. alpha.62 live gate must include negative grep to confirm switchover script CM does NOT contain grant_remote_root_optional_admin_privileges_for_secondary nor admin bypass GRANT statements for user-facing root. Single-source-of-truth constants (Jack 04:08 Tightening 3 / 04:10 coding guardrail 3) defined at top of replication-switchover.sh: * SWITCHOVER_BYPASS_PRIVILEGES_PATTERN * SWITCHOVER_USER_FACING_WRITE_PATTERN * SWITCHOVER_SECONDARY_FENCE_GRANT_BODY * SWITCHOVER_EXPLICIT_PRIMARY_GRANT_BODY * SWITCHOVER_PRIMARY_CORE_WRITE_PRIVS ShellSpec strong-binds the EXPLICIT_PRIMARY_GRANT_BODY contains the core write privs, preventing future drift between unfence_local_remote_root_for_primary GRANT body and remote_root_has_explicit_primary_grant verifier. candidate_is_primary lost its remote_root_has_full_access check (the legacy GRANT ALL PRIVILEGES signature no longer exists post-alpha.60 v2 unfence + alpha.61 v3 roleProbe primary fence). Remaining 4 signals (read_only=0 + no slave_status + remote_root_write_ready INSERT probe + syncer role=primary) are sufficient — the write_ready INSERT probe on the candidate is itself the strongest signal. ShellSpec increments (212 examples, 0 failures, alpha.61 v3 was 195): * Updated existing fence test to assert NO bypass priv grants, alpha.62 GRANT body grep * Updated unfence test to assert per-host invocation + no admin bypass + grant body invariant strong-bind * Added 4 run_switchover stubs for enumerate_user_facing_root_hosts * New Describe "alpha.62 v1 helpers and verifiers" with: - compute_grants_sha (2 examples: sha256 happy path, all-tools-missing fallback to unavailable:hash_tool_unavailable sentinel) - enumerate_user_facing_root_hosts (2 examples: rc=0 with host list, rc!=0 → fail-closed root_host_query_failed) - _verify_host_is_fenced (7 examples: 127.0.0.1 ok_by_local_probe:1044, localhost ok_by_grants_only, % ok_by_grants_only, READ_ONLY ADMIN bypass residual, INSERT user-facing-write residual, write probe rc=0 writable_unexpected, grants_query_failed unrelated stderr) - _verify_host_has_explicit_primary_grant (4 examples: happy path, all_privileges_residual, core_write_priv_missing, admin_bypass_residual) - fence_local_remote_root_for_secondary drift detection (1 example: double-enumeration sha mismatch → fail-closed root_host_list_drift) Caveats carried: * cmpd-semisync.yaml UNLOCK paths still GRANT ALL PRIVILEGES (DRIFT D, alpha.63+) * alpha.61 v3 caveat: prepare/fence inner SQL helpers no per-call remaining-budget (alpha.63+ or independent PR) * alpha.61 process miss (live gate ACCEPT → execute commit boundary) independently attached to alpha.61 v3 live gate ACCEPT post; not affected by alpha.62 runtime closeout Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…drift alpha.62 v1 (commit 675f537) was caught at Jack's package-level review (msg c66d35bf) for two issues that don't change runtime behavior but break the v2 design's live-gate negative-grep contract: Live-gate negative grep blocker: v1 design committed to alpha.62 live gate negative grep on the literal function names `grant_remote_root_optional_admin_privileges_for_secondary` and `remote_root_has_full_access`. The function bodies were correctly removed/renamed in v1, but the *comments* in the rewritten functions still referenced the old names verbatim. Rendered manifest grep would hit those comments and either false-RED the live gate or be forced to carry a comment-only caveat (alpha.61 already has the same anti-pattern for `$SECONDS` / `$'\n'` comments — alpha.62 should not double down). Fix: rewrite the four comment mentions to descriptive text ("legacy optional secondary admin grant helper", "legacy full-access rollback verifier", etc.) so the literal old function names appear nowhere in source nor rendered manifest. grants_sha format tightening: v1 returned `grants_sha=unavailable:hash_tool_unavailable` (single colon-joined field) per the v1 design. v2 design instead splits this into two structured fields: `grants_sha=<hash|unavailable> reason_hash=<sha256|sha1|md5|hash_tool_unavailable>`. This avoids grep/awk needing to disambiguate colon semantics in downstream parsers and matches the rest of the structured log style. Fix: compute_grants_sha now emits `<hash>|<algo>` (pipe-separated internal token used for direct comparison in drift detection); a new helper split_grants_sha_field produces the two-field log fragment `grants_sha=<hash> reason_hash=<algo>` for inline use in verifier log lines. All verifier log lines (_verify_host_is_fenced and _verify_host_has_explicit_primary_grant) now embed ${grants_sha_field} (already-formatted) instead of the legacy `grants_sha=${grants_sha}` template. The host_list_sha debug logs (informational, not part of the structured verifier contract) keep the internal `<hash>|<algo>` form for drift comparison. ShellSpec increments (215 examples, 0 failures, alpha.62 v1 was 212): * Renamed Context "compute_grants_sha()" → "compute_grants_sha() / split_grants_sha_field()" (5 examples total: sha256 happy path, unavailable|hash_tool_unavailable token format, split happy path, split unavailable case, split defensive single-token case). * Updated _verify_host_is_fenced 127.0.0.1 ok_by_local_probe:1044 example to assert grants_sha + reason_hash=sha256 fields appear as TWO separate fields (not colon-joined). v2 caveats unchanged from v1: * DRIFT D (cmpd-semisync.yaml UNLOCK GRANT ALL PRIVILEGES) → alpha.63+ * alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.63+ * alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post `5a6390fe` (NOT covered by alpha.62 runtime closeout per Cindy 03:13 directive) * Slock attachment fetch transient (Jack v1 attachment download fail-back) — v2 re-attached as fresh attachment id Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

alpha.62 switchover/role-transition under-load N=1 came back RED at the pre-DCS verifier (Jack closeout msg 7f880bea) with two implementation bugs in the alpha.62 v1/v2 verifier code that escaped both ShellSpec coverage and 8-class XP review because they only surface against runtime-realism inputs: I-1 — `GRANT OPTION` over-match against default GRANT PROXY row: SWITCHOVER_USER_FACING_WRITE_PATTERN included `GRANT OPTION` as a literal token. mariadb auto-creates a default `GRANT PROXY ON ''@'%' TO 'root'@'%' WITH GRANT OPTION` row that survives REVOKE ALL PRIVILEGES (PROXY priv is in a separate priv class). The verifier's grep matched `GRANT OPTION` substring inside that row and reported bypass_priv_residual:GRANT OPTION even when the actual fence main grant was clean. I-2 — multi-line SQL stderr broke pipe-separated probe parser: _local_root_write_probe_127 returned `printf '%s|%s|%s' rc errno out` on stdout. When `out` contained the multi-line SQL stderr that mariadb client emits (e.g., `ERROR 1044 (42000) at line N:\n Access denied for user...\n to database 'kubeblocks'`), the caller's `cut -d'|' -f2` returned `1044\n<line2-of-stderr>` (cut operates per line; lines 2+ have no `|`, so cut returned the whole line for field 2). The case-statement against this multi-line value never matched the `1044|1290|1142` literals, so a real priv-based fence was misclassified as `probe_account_mismatch`. alpha.63 v1 fixes (per Jack 05:24 instrumentation tightening): I-1 fix: `GRANT OPTION` token REMOVED from SWITCHOVER_USER_FACING_WRITE_PATTERN (it was a trailing modifier, not a priv name; the remaining tokens INSERT/UPDATE/DELETE/CREATE/DROP/ALTER/CREATE USER are unambiguous priv names). Defense-in-depth: a new line-anchored SWITCHOVER_GRANTS_IGNORED_LINE_PATTERN whitelist (`^GRANT PROXY ON .* TO .* WITH GRANT OPTION$`) is applied BEFORE the bypass / write residual scan via three independent helpers: * _filter_grants_keep_unmatched (echoes filtered grants on stdout) * _count_grants_matched_whitelist (echoes integer count) * _dump_grants_matched_whitelist (echoes matched lines for audit) Each helper is invoked in its own `$(...)` subshell so the count + dump aren't lost to the subshell-globals problem. The verifier log adds `grants_ignored_count=<N>` to every line and dumps ignored lines after the main grants_dump on failure paths. Surprise lines like `GRANT INSERT ... WITH GRANT OPTION` are NOT silently whitelisted (line-anchored pattern is precise, not broad `grep -v PROXY`). I-2 fix: _local_root_write_probe_127 now writes its three result fields into module-scope global variables __PROBE_RC, __PROBE_ERRNO, __PROBE_OUT instead of joining with `|` and echoing on stdout. Caller pre-clears the three globals BEFORE the call (defends against stale value reuse) and post-validates that __PROBE_RC is non-empty numeric (else fail-closed `probe_result_malformed`) and __PROBE_ERRNO is in the 5-value valid set {1044, 1290, 1142, 0, other} (else fail-closed `probe_result_malformed_errno`). Multi-line SQL stderr is preserved intact in __PROBE_OUT and dumped after the structured log line on failure paths. ShellSpec increments (226 examples, 0 failures, alpha.62 v2 was 215): * Context "grants whitelist helpers" (5 examples): - _filter_grants_keep_unmatched filters PROXY default row from output - _count_grants_matched_whitelist returns 1 for one PROXY row - _filter_grants_keep_unmatched does NOT whitelist non-proxy `WITH GRANT OPTION` lines - _count_grants_matched_whitelist returns 0 for no PROXY shape - _count_grants_matched_whitelist returns 2 for multiple PROXY rows * Context "_verify_host_is_fenced() runtime-realism: GRANT PROXY default row" (2 examples): - % host with non-bypass main grant + default GRANT PROXY row → reason=ok_by_grants_only:wildcard_or_remote_not_locally_probable + grants_ignored_count=1 in log (alpha.62 v1/v2 was false-RED bypass_priv_residual:GRANT OPTION — closed) - localhost host with PROXY default row → ok_by_grants_only + grants_ignored_count=1 (alpha.62 RED parity case) * Context "_local_root_write_probe_127() global var hardening" (4 examples): - pre-clear globals defends against stale value reuse - 127.0.0.1 with multi-line SQL stderr containing 1044 → __PROBE_ERRNO=1044 correctly extracted (alpha.62 RED root cause closed) - post-validate __PROBE_RC non-numeric → fail-closed `probe_result_malformed` - post-validate __PROBE_ERRNO not in valid set → fail-closed `probe_result_malformed_errno` Carry-forward notes (Jack RED closeout): * pod0 secondary admin bypass residual after failed switchover: noted by Jack as alpha.63+ scope candidate. Not in this fix scope unless callsite-pair scan finds new evidence pointing to a path other than I-1/I-2. Independent of DRIFT D / live-grants validation. * DRIFT D (cmpd-semisync.yaml UNLOCK GRANT ALL PRIVILEGES) → still alpha.64+ scope. * alpha.61 v3 prepare/fence inner SQL helper budget caveat → still alpha.64+ scope. * alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post `5a6390fe`. * alpha.62 RED 已 invalidate "alpha.62 review-pass→execute clean cadence" 单独成立的叙事 (Cindy 05:14 directive `ba10ff18`): process clean + runtime RED 反而暴露 ShellSpec mock-coverage 的 runtime-realism gap. cadence-discipline candidate topic 重新表述为 "process-discipline + runtime-validation 是独立维度". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…alpha.63 v2) alpha.63 v1 (commit 423703e) was caught at Jack's package-level review (msg 4cfdd261, 08:36) for one contract field that the v1 implementation left unenforced at the verifier read site: The 05:26 design contract said: "non-proxy `WITH GRANT OPTION` must fail-closed". v1 only: * removed `GRANT OPTION` literal token from SWITCHOVER_USER_FACING_WRITE_PATTERN * added line-anchored proxy whitelist SWITCHOVER_GRANTS_IGNORED_LINE_PATTERN which closed the false-RED on the default `GRANT PROXY ... WITH GRANT OPTION` row, but a SELECT-only-with-GRANT-OPTION input like `GRANT SELECT ON *.* TO 'root'@'%' WITH GRANT OPTION` would now false-PASS: * not whitelisted (doesn't match `^GRANT PROXY ON .*`) * SELECT not in user-facing-write pattern → write_residual empty * no SUPER/READ_ONLY ADMIN/etc. → bypass_residual empty * verifier returns ok_by_grants_only The v1 ShellSpec example for "non-proxy WITH GRANT OPTION must fail-closed" used `GRANT INSERT, UPDATE ... WITH GRANT OPTION`, which fail-closes via INSERT/UPDATE in the write residual scan — so the WITH-GRANT-OPTION-as-bypass-token semantic was never actually exercised. alpha.63 v2 fix (per Jack 08:36 review HOLD): * Add explicit `grant_option_residual` check in _verify_host_is_fenced AFTER the proxy whitelist filter and AFTER the user-facing-write residual check. The check awks for any line containing literal ` WITH GRANT OPTION` (with leading space — the trailing clause marker). Since PROXY rows have already been removed by the whitelist filter, any remaining match is non-proxy → fail-closed with a distinct sentinel `reason=grant_option_residual` (NOT folded into bypass_priv_residual, so closeout can grep specifically for this token-level violation). * Structured log adds `grants_bypass=GRANT_OPTION` field plus a separate `grant_option_residual_dump_begin/end` block dumping the offending lines. Short-circuit order is: bypass_priv_residual (admin bypass priv names) → write_priv_residual (INSERT/UPDATE/...) → grant_option_residual (WITH GRANT OPTION clause). Tests lock this precedence so a real `GRANT INSERT WITH GRANT OPTION` input still produces `bypass_priv_residual:INSERT,UPDATE` (NOT grant_option_residual), preserving alpha.63 v1 semantics for that case while v2 catches the WITH-GRANT-OPTION-only edge case. ShellSpec increments (228 examples, 0 failures, alpha.63 v1 was 226): * NEW: `GRANT SELECT ... WITH GRANT OPTION` (no write priv name + GRANT OPTION clause) → fail-closed reason=grant_option_residual + grants_bypass=GRANT_OPTION + grant_option_residual_dump * NEW: short-circuit precedence lock — `GRANT INSERT, UPDATE ... WITH GRANT OPTION` still hits bypass_priv_residual:INSERT,UPDATE, NOT grant_option_residual v2 caveats unchanged from v1: * DRIFT D (cmpd-semisync.yaml UNLOCK GRANT ALL PRIVILEGES) → alpha.64+ * alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.64+ * pod0 secondary admin bypass residual after failed switchover (alpha.62 RED carry-forward) → alpha.64+ candidate * alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post `5a6390fe` * cadence-discipline candidate topic restated as "process-discipline + runtime-validation are independent dimensions" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ract (alpha.64 v1) alpha.63 fresh-gatefix switchover N=1 RED root cause completely confirmed via Jack's two read-only evidence sets (msg 2219dcb5 09:56): (1) prestop-watchdog.log at 01:25:11-13Z (RED window) shows 8 lines exactly matching the bypass priv extras observed in fence + rollback verifier dumps: prestop-watchdog local-root-account-UNLOCK mode=full-access label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=REPLICATION SLAVE ADMIN label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=REPLICATION MASTER ADMIN label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=BINLOG ADMIN label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=BINLOG MONITOR label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=SLAVE MONITOR label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=CONNECTION ADMIN label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog local-root-optional-privilege privilege=READ_ONLY ADMIN label=primary-read-write host=127.0.0.1 rc=0 prestop-watchdog remote-root-account-UNLOCK mode=full-access label=primary-read-write host=% rc=0 (2) 6-sample SHOW GRANTS timeline (01:53:19Z→01:53:31Z) shows root@127.0.0.1 + root@localhost stable GRANT ALL PRIVILEGES with GRANT OPTION across all samples — proving the cmpd-semisync.yaml sql-listener-fence reconcile loop persistently re-grants admin bypass privileges, NOT a transient flap. Smoking gun: cmpd-semisync.yaml `set_local_root_account_state UNLOCK` + `set_remote_root_account_state UNLOCK` paths grant `GRANT ALL PRIVILEGES`, and `grant_optional_local_root_privileges` + `grant_optional_remote_root_privileges` add admin bypass privileges (BINLOG ADMIN / READ_ONLY ADMIN / CONNECTION ADMIN / REPLICATION SLAVE ADMIN / REPLICATION MASTER ADMIN). All these target user-facing root (root@%/127.0.0.1/localhost) and run on every reconcile loop iteration, racing the switchover-side fence script. This means the alpha.59-onwards "user-facing root contains no admin bypass" contract has actually NEVER been enforced in cmpd-yaml UNLOCK/LOCK paths. switchover-side (alpha.62/63) and roleProbe-side (alpha.61 v3) fences were tightened to non-bypass list, but cmpd-yaml runtime kept re-granting them back. Previous alpha.59-.62 verifiers weren't fine-grained enough to observe; alpha.63 v2 verifier (post 47-min RED→root-cause-fully-closed analysis) finally fail-closed and exposed the root cause. alpha.64 v1 fix scope (per Jack 10:01 v2 design ack + 10:05 Tier A/B boundary + 10:13 Cindy 4 ship-gate): cmpd-semisync.yaml 7 callsite alignments (account class: writer-visible user-facing root; kb_internal_root maintenance executor remains legit with full ALL PRIVILEGES exception): 1. grant_optional_local_root_privileges (line 590-604): drop REPLICATION SLAVE ADMIN, REPLICATION MASTER ADMIN, BINLOG ADMIN, CONNECTION ADMIN, READ_ONLY ADMIN; only CMPD_OPTIONAL_MONITOR_PRIVS (BINLOG MONITOR, SLAVE MONITOR) remain. 2. set_local_root_account_state LOCK (line 622): drop SUPER from grant body; use CMPD_SECONDARY_FENCE_GRANT_BODY (SELECT, PROCESS, RELOAD, REPLICATION SLAVE, REPLICATION CLIENT, REPLICATION MASTER ADMIN). 3. set_local_root_account_state UNLOCK (line 633): replace GRANT ALL PRIVILEGES with CMPD_EXPLICIT_PRIMARY_GRANT_BODY (aligned with switchover.sh SWITCHOVER_EXPLICIT_PRIMARY_GRANT_BODY). 4. set_remote_root_account_state LOCK (line 672): drop SUPER; use CMPD_SECONDARY_FENCE_GRANT_BODY. 5. set_remote_root_account_state UNLOCK (line 683): replace GRANT ALL PRIVILEGES with CMPD_EXPLICIT_PRIMARY_GRANT_BODY. 6. grant_optional_remote_root_privileges (line 696-715): drop BINLOG ADMIN, CONNECTION ADMIN, READ_ONLY ADMIN; only CMPD_OPTIONAL_MONITOR_PRIVS remain. 7. lock_local_root_for_prestop (line 1564-1587): drop SUPER from grant body; align with secondary fence semantics. Note: preStop hook is a separate /bin/sh -c shell scope so the CMPD_SECONDARY_FENCE_GRANT_BODY constant is NOT in scope; the literal grant list is duplicated with comment marking the keep-in-sync requirement; ShellSpec rendered grep enforces both callsites. Constants (defined at line 153 area): CMPD_EXPLICIT_PRIMARY_GRANT_BODY (matches switchover.sh SWITCHOVER_EXPLICIT_PRIMARY_GRANT_BODY) CMPD_SECONDARY_FENCE_GRANT_BODY (matches switchover.sh SWITCHOVER_SECONDARY_FENCE_GRANT_BODY) CMPD_OPTIONAL_MONITOR_PRIVS = "BINLOG MONITOR SLAVE MONITOR" Tier A vs Tier B fail-closed (Jack 10:05 boundary): * Tier A — best-effort MONITOR grant: failure logs `tier=monitor-best-effort 1227_swallowed=true rc=1` and continues (does not propagate failure to caller; MONITOR types don't gate primary-write/secondary-fence semantics) * Tier B — required account-state / LOCK / UNLOCK / prestop fence grant: failure logs `tier=required 1227_swallowed=true fail_closed=true rc=1` and returns 1; caller MUST NOT publish ready/role markers Account class separation (Cindy 10:13 directive): kb_internal_root grants in ensure_internal_local_admin (line 482-495) + grant_internal_admin_runtime_privileges (line 458-481) legitimately use GRANT ALL PRIVILEGES + 7-priv loop including admin bypass — those are the maintenance executor and need full admin to run STOP SLAVE / SET GLOBAL read_only / SET GLOBAL rpl_semi_sync_master_enabled / etc. Negative grep contract uses awk-based block analysis to skip lines within 30-line windows preceded by `user="$(sql_quote "${MARIADB_INTERNAL_ROOT_USER}")"`, ensuring kb_internal_root grants are not flagged as violations. cmpd-replication.yaml + cmpd-galera.yaml scanned clean (no GRANT/REVOKE statements involving root account state). ShellSpec increments (237 examples, 0 failures, alpha.63 v2 was 228 → +9): * Describe "alpha.64 v1 cmpd-semisync grant body contract" (8 examples): - cmpd constants strong-bind alignment (3 examples covering all 3 constants) - rendered manifest user-facing root negative grep (1 example using awk block analysis to skip kb_internal_root context, asserts empty output) - kb_internal_root account class allowlist positive (1 example) - MONITOR positive allowlist (BINLOG MONITOR / SLAVE MONITOR present in user-facing root context) - Tier A monitor priv grant emits `tier=monitor-best-effort 1227_swallowed=true` (review-tightening) - Tier B required grant emits `tier=required ... fail_closed=true` (product-blocker) - Live-gate runtime contract documented in source (assert source contains `alpha.64 v1.*Jack 09:35 RED` marker) * Updated existing semisync_rejoin_fence_template_spec.sh tests: - "locks local root without granting table writes" → now asserts `GRANT ${CMPD_SECONDARY_FENCE_GRANT_BODY}` (no SUPER literal) - New: "secondary fence grant body constant explicitly excludes SUPER" assertion on the constant definition Live-gate runtime contract (Jack tightening + Cindy ship-gate, evidenced in handoff): * static negative grep at install-script gate: 0 hits for admin bypass GRANT statements targeting user-facing root (root@%/127.0.0.1/localhost) * runtime negative gate: fresh install stable window (~30-60s) check `prestop-watchdog.log` 0 hits for `local-root-optional-privilege privilege=BINLOG ADMIN|CONNECTION ADMIN|READ_ONLY ADMIN|REPLICATION SLAVE ADMIN|REPLICATION MASTER ADMIN`; distinguishes fresh scene from preserved scene log Boundary (per Cindy 10:13 + Jack 10:07 5 review focal): * alpha.63 N=1 RED/PRESERVED stays as product conclusion * alpha.64 v1 ship is patch-version + static + live gate; NOT product GREEN * fresh `alpha.64 switchover/role-transition under-load N=1` runtime required to change product conclusion (verify alpha.63 v1+v2 verifier + alpha.64 v1 grant body alignment all hold under real switchover) * preserved scenes (gatefix-fglxq + 005935 baseline) all unchanged as evidence baseline Carry-forward removed: * DRIFT D upgraded from "alpha.65+ deferred" to "alpha.64 v1 main fix"; no longer carry-forward. Carry-forward unchanged: * alpha.61 v3 prepare/fence inner SQL helper budget caveat → alpha.65+ * alpha.61 process miss preserved at alpha.61 v3 live gate ACCEPT post `5a6390fe` * Cadence-discipline candidate restated as "process-discipline + runtime-validation are independent dimensions" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

….64 v2) Closes Jack 10:32 HOLD blockers on alpha.64 v1: Blocker 1: Tier B required LOCK failures were swallowed by caller-side `|| true` patterns. v1 only enforced rc inside lock_(local|remote)_root_writes helpers; callers (set_replica_read_only / publish_replica_after_rejoin_ready / keep_replica_pending_until_healthy / expose_sql_listener_for_safe_role / configure_replication_from_primary_service_once / reconcile_sql_listener_for_syncer_secondary_once) still wrapped them in `|| true` so a 1227-fenced grant did not stop ready/role publish. Blocker 2: lock_local_root_for_prestop double-failure was masked by trailing `|| true` after socket+tcp fallback chain. v2 changes (cmpd-semisync.yaml only; switchover.sh / roleProbe.sh untouched): Tier B caller propagation: - set_replica_read_only: track rc across remote+read-only+local LOCK; return 1 on any failure with structured log fail_closed=true. - keep_replica_pending_until_healthy: same pattern; return 1 propagates to existing `if !` callers. - expose_sql_listener_for_safe_role: required local LOCK + read_only checked via `if ! ...; then return 1; fi`; touch .sql-listener-ready only after both succeed. - publish_replica_after_rejoin_ready: replace `set_replica_read_only || true` with `if ! set_replica_read_only; then return 1; fi` for both call sites (before-expose + after-expose); mark_replication_ready only reached after all required steps succeeded. - configure_replication_from_primary_service_once: enter-time set_replica_read_only checked via `if !` and returns 1 on failure. - reconcile_sql_listener_for_syncer_secondary_once: same; mark_replication_ready only after rc=0. preStop double-failure: - Replace `lock_local_root_for_prestop "prestop" "socket" || lock_local_root_for_prestop "prestop" "tcp" || true` with explicit `if ! ...; then if ! ...; then prestop_log "prestop_lock_failed_both fail_closed=true tier=required"; fi; fi` block. Live-gate runtime negative gate watches for this token. Tier annotation auditable list (per Jack 10:38 review-checkpoint 3): - Every allowed `lock_(local|remote)_root_writes ... || true` callsite carries inline `# tier=startup-defensive|error-recovery|fail-path-defensive| monitor-best-effort` comment. Total 16 annotated callsites. ShellSpec increments (+12 net, 249 examples / 0 failures): - New Describe `alpha.64 v2 cmpd-semisync Tier B caller propagation contract` with 12 examples covering tier annotation list (3) + tier annotation count invariant (1) + per-function rc propagation pattern (6) + preStop fail-closed token + preStop block uses if !... pattern. Static checks: bash -n / dash -n / helm lint / helm template all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… v3) Closes Jack 11:14 install/script live-gate RED on alpha.64 v2: the optional-monitor priv loop iterated CMPD_OPTIONAL_MONITOR_PRIVS via unquoted parameter expansion (`for privilege in ${CMPD_OPTIONAL_MONITOR_PRIVS}`), which POSIX `for` splits on IFS into 4 single-word tokens (BINLOG / MONITOR / SLAVE / MONITOR). `GRANT BINLOG ON *.* ...` is invalid SQL, so root never acquired SLAVE MONITOR; every `SHOW SLAVE STATUS` returned 1227, breaking roleProbe / promote / demote and leaving Component condition Healthy=False reason=RoleProbeNotDone in fresh install. v3 changes (cmpd-semisync.yaml only; alpha.64 v1+v2 contracts retained): - Both grant_optional_local_root_privileges and grant_optional_remote_root_privileges now iterate inline quoted list `for privilege in "BINLOG MONITOR" "SLAVE MONITOR"`, preserving multi-word priv name semantics. - The CMPD_OPTIONAL_MONITOR_PRIVS constant is retained for documentation and ShellSpec strong-bind, with an extensive root-cause comment block warning that the constant is for documentation only and MUST NOT be iterated via unquoted parameter expansion. - Per-callsite docstrings updated with a v3 note pointing to the constant block and explaining the inline-quoted rationale. ShellSpec increments (+6 net, 255 examples / 0 failures): - New Describe `alpha.64 v3 cmpd-semisync multi-word MONITOR priv loop` with 6 examples in 3 contexts: - "no unquoted CMPD_OPTIONAL_MONITOR_PRIVS for-loop residual" (2): negative grep for both `${CMPD_OPTIONAL_MONITOR_PRIVS}` (braced) and `$CMPD_OPTIONAL_MONITOR_PRIVS` (no-brace) variants in ACTIVE code (comment lines stripped so the documentation block is allowed). - "inline quoted MONITOR list at both callsites" (2): per-function grep asserts the v3 inline-quoted-list pattern is present and the v1/v2 unquoted-loop pattern is absent. - "live-gate runtime negative gate documentation" (2): documentation marker confirms the v3 root-cause comment is present + a contract- no-regression spot-check verifies CMPD_EXPLICIT_PRIMARY_GRANT_BODY + CMPD_SECONDARY_FENCE_GRANT_BODY constants, 4 `if ! set_replica_read_only` callsites, the prestop_lock_failed_both fail-closed token, and 16 tier-annotated swallow lines all remain present. Static checks: bash -n / dash -n / helm lint / helm template all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes Jack 11:35 install/script live-gate RED on alpha.64 v3 commit c686788. KubeBlocks treats the existing ComponentDefinition spec as immutable: alpha.64 v2 + v3 mutated cmpd-semisync.yaml under the same chart version 1.1.1-alpha.64, so KB controller rejected the update with "ComponentDefinition mariadb-semisync-1.1.1-alpha.64 ... immutable fields cant be updated" and the CmpD stayed Unavailable. Helm upgrade applied the manifest but the live cluster never saw the v3 multi-word MONITOR fix (no fresh namespace was even started — the live gate blocked at the CmpD-Available check). alpha.65 v1 changes (cmpd-semisync.yaml unchanged from alpha.64 v3): - Chart.yaml version 1.1.1-alpha.64 -> 1.1.1-alpha.65 with an extensive comment block explaining the CmpD immutability rationale and the rule that any future patch within an alpha cycle that mutates cmpd-*.yaml MUST bump the chart version. Patches that only touch versioned ConfigMap contents (replication-switchover.sh) can keep the same chart version because the script CM is not immutable; that is why alpha.61 v2/v3, alpha.62 v2, alpha.63 v2 could all reuse the same chart version. - Chart.yaml appVersion remains 11.4.10 (mariadb engine version unchanged; this bump is packaging-contract only). - cmpd-semisync.yaml content preserved verbatim from alpha.64 v3 (sha 237eddbc42acc662329fd5b6a654633a80dce94756de4331af48db3c23d3999a). All alpha.64 v1 grant body alignment, v2 caller propagation + tier annotation + preStop fail-closed token, and v3 multi-word inline quoted MONITOR list remain in place. ShellSpec increments (+4 net, 259 examples / 0 failures): - New Describe `alpha.65 v1 chart version bump for CmpD immutability` with 4 examples: chart version is exactly 1.1.1-alpha.65 (positive literal); appVersion remains 11.4.10; Chart.yaml documentation marker references both Jacks live-gate RED and the immutability rationale; cmpd-semisync.yaml retains the alpha.64 v3 root-cause comment marker (proves CmpD spec content is preserved verbatim). Live-gate v3 RED scene at helm revision 61 with mariadb-semisync-1.1.1-alpha.64 Unavailable is left as evidence; the fresh alpha.65 install will create a new mariadb-semisync-1.1.1-alpha.65 CmpD without touching the alpha.64 orphan. Static checks: bash -n / dash -n / helm lint / helm template all pass. helm template confirms the new CmpD names render as mariadb-{semisync,replication,galera}-1.1.1-alpha.65. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ization compat (alpha.65 v2) Closes Jack 11:45 alpha.65 v1 HOLD msg 721ad0a3. The v1 doc-marker test greps for the literal alpha.65 v1 / Jack 11:35 / live-gate RED comment in Chart.yaml. The test passed in source-tree but failed when ShellSpec was rerun inside an extracted package, because helm package canonicalizes Chart.yaml (alphabetizes keys + removes blank lines/comments + strips quotes). The comment was therefore not in the package-installed Chart.yaml. v2 changes (ShellSpec only; Chart.yaml + cmpd-semisync.yaml unchanged): - Drops the It block alpha.65 v1: Chart.yaml documents the CmpD immutability rationale (1 example removed; 258 examples / 0 failures). - Adds a comment block above the Describe explaining why the doc-marker test was removed and where the rationale documentation now lives (source Chart.yaml comment block, this Describe leading comment, PR body, Slock handoff thread, sediment doc backlog). - The 3 hard contracts retained: chart version is exactly 1.1.1-alpha.65; appVersion still contains 11.4.10; cmpd-semisync.yaml retains the alpha.64 v3 root-cause comment marker (proves CmpD content preserved). ShellSpec deltas: - Total: 259 -> 258 examples (-1 net, 0 failures). - Source-tree run: 258/0 confirmed. - Package-extraction run will now also pass because the dropped test was the only one that depended on Chart.yaml comment text. Going forward package-extraction ShellSpec rerun is added to the ship checklist. Static checks: bash -n / dash -n / helm lint / helm template all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@localhost

… v1) Closes Jack 12:18 alpha.65 v2 install/script live-gate RED + Jack 12:34 alpha.66 v1 design HOLD + Jack 12:39 design ACCEPT with 3 tightening. alpha.65 v2 RED first-blocker: product/addon executor 分工 contract gap. alpha.64 v1 correctly removed admin-bypass privileges from the user-facing root, but syncer (cmpd-semisync.yaml line 1854 KB_SERVICE_USER binds to MARIADB_ROOT_USER) used the user-facing root for HA promote/demote SQL that needs REPLICATION SLAVE ADMIN (SET GLOBAL rpl_semi_sync_slave_enabled) and READ_ONLY ADMIN (SET GLOBAL read_only=ON). Result: 1227 errors, Take the leader failed, demote failed, fresh bootstrap stuck in RoleProbeNotDone for the entire bounded gate window. Investigation of apecloud/syncer source (worktree syncer-pr142): - 3-tier credential model in engines/mysql/config.go: Root (KB_SERVICE_USER), Admin (MYSQL_ADMIN_USER, falls back to Root), Replication (falls back to Admin). - Auto-switch in engines/mysql/manager.go IsRunning(): once IsAdminCreated returns true (mysql.user query WHERE host='%' AND user LIKE 'kb%'), mgr.DB swaps to AdminDB. - HA Promote/Demote (semi_sync.go + slave.go) all use mgr.DB.Exec(), so once the swap happens, all HA SQL automatically uses the admin executor. alpha.66 v1 fix (chart-only; syncer source untouched): cmpd-semisync.yaml env block additions (Jack 12:34 design HOLD blockers 1 + 2 closed): - MYSQL_ADMIN_USER: literal "kb_internal_root" — NOT a $(MARIADB_INTERNAL_ROOT_USER) env-substitution because K8s env expansion order is not guaranteed. - MYSQL_ADMIN_PASSWORD: $(MARIADB_ROOT_PASSWORD), shared with root password per the existing ensure_internal_local_admin pattern. cmpd-semisync.yaml ensure_internal_local_admin SQL additions (Jack 12:39 design tightening 3 closed): - Existing kb_internal_root@localhost + @127.0.0.1 paths preserved verbatim (full GRANT ALL PRIVILEGES ... WITH GRANT OPTION; this is what syncer's AdminDB connection from 127.0.0.1:3306 actually authenticates against). - New detection-only kb_internal_root@'%' record: CREATE USER ... @'%' IDENTIFIED BY <pwd>; ALTER USER ... @'%' ACCOUNT LOCK; intentionally zero GRANT statements. Required so syncer's IsAdminCreated() (which queries mysql.user WHERE host='%') can detect kb_internal_root and trigger the AdminDB swap, without expanding the remote attack surface (LOCK rejects remote auth; even if LOCK is somehow bypassed, the @'%' record has no privileges). Chart.yaml bump 1.1.1-alpha.65 -> 1.1.1-alpha.66 with documentation of the CmpD immutability rule (per alpha.64 v3 -> alpha.65 lesson). The existing alpha.65 v1 comment block is preserved; new alpha.66 v1 block appended. ShellSpec increments (+10 net, 268 examples / 0 failures): - Renamed Describe "alpha.65 v1 chart version bump for CmpD immutability" to cover both alpha.65 + alpha.66 chart bump rule (test asserts current literal alpha.66). - New Describe "alpha.66 v1 syncer HA executor + chart bump" with 9 examples in 4 contexts: - chart bump for CmpD immutability (2 examples: version=alpha.66, appVersion=11.4.10). - syncer executor contract (3 examples: MYSQL_ADMIN_USER literal kb_internal_root, MYSQL_ADMIN_PASSWORD MARIADB_ROOT_PASSWORD, KB_SERVICE_USER unchanged MARIADB_ROOT_USER). - detection-only @'%' record contract (4 examples: ensure_internal_local_admin body creates @'%' with IDENTIFIED BY, body locks @'%' via ACCOUNT LOCK, body has zero GRANT to @'%' (negative scan), body retains GRANT ALL PRIVILEGES to @localhost AND @127.0.0.1 internal exception). - alpha.64 v1+v2+v3 + alpha.65 contract no-regression spot-check (1 example: invariant counts equal "1 1 4 1 16 2"). Static checks: bash -n / dash -n / helm lint / helm template all pass. helm template confirms 3 new alpha.66 CmpD names (semisync/replication/galera). Live-gate runtime acceptance (per Jack 12:39): - new CmpD mariadb-semisync-1.1.1-alpha.66 Available - syncer log hits "switch to admin db" - local AdminDB CURRENT_USER() == kb_internal_root@127.0.0.1 - SHOW CREATE USER 'kb_internal_root'@'%' contains ACCOUNT LOCK - SHOW GRANTS FOR 'kb_internal_root'@'%' returns USAGE only / no admin priv - user-facing root @%, @localhost, @127.0.0.1 still no admin bypass - stable window 0 hit on Error 1227 (REPLICATION SLAVE ADMIN / READ_ONLY ADMIN), Take the leader failed, demote failed, RoleProbeNotDone - vcluster-only execution in idc/idc1/idc2/idc4 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@localhost

…lpha.67 v1) Closes Jack 12:56 alpha.66 v1 package-level review HOLD msg 97e74e30. alpha.66 v1 introduced a detection-only kb_internal_root@'%' record so syncer's IsAdminCreated (which queries mysql.user WHERE host='%') can detect the admin user and trigger the AdminDB swap. The security contract said this @'%' record carries ACCOUNT LOCK + zero privileges, so even if LOCK is somehow bypassed, the account cannot run any SQL. But the contract was only declarative: CREATE USER IF NOT EXISTS does not clear an existing account's privileges, and ACCOUNT LOCK is not a revoke. If kb_internal_root@'%' happened to pre-exist with grants (misconfigured prior install or upgrade), alpha.66 v1 would lock the account but leave the privileges intact, violating the security contract. alpha.67 v1 changes (chart-only; cmpd-semisync.yaml + Chart.yaml + spec): cmpd-semisync.yaml ensure_internal_local_admin SQL: insert an explicit REVOKE ALL PRIVILEGES, GRANT OPTION FROM '${user}'@'%' between CREATE USER IF NOT EXISTS '${user}'@'%' and ALTER USER '${user}'@'%' ACCOUNT LOCK. This pattern matches the alpha.64 v1 LOCK paths (set_local/remote_root_account_state LOCK and lock_local_root_for_prestop) which already use the same REVOKE statement before re-applying the non-bypass grant body. Chart.yaml bump 1.1.1-alpha.66 -> 1.1.1-alpha.67 (KB CmpD immutability rule, alpha.65 lesson). Cumulative comment block preserved (alpha.65 v1 + alpha.66 v1 + alpha.67 v1 rationale all retained for audit history). ShellSpec increments (+4 net, 272 examples / 0 failures): - Renamed alpha.65 v1 chart-version-bump regression test literal to alpha.67. - Renamed alpha.66 v1 chart-version literal test to alpha.67 (same chart bump rule applies). - New Describe `alpha.67 v1 ensure_internal_local_admin write-site zero-priv enforcement` with 4 examples in 3 contexts: - chart bump literal alpha.67 (1 example). - write-site REVOKE step (2 examples: REVOKE ALL PRIVILEGES, GRANT OPTION FROM @'%' present in function body; ordering CREATE @'%' before REVOKE @'%' before ALTER @'%' ACCOUNT LOCK). - alpha.66 v1 negative + alpha.64+.65 invariants preserved (1 example: zero GRANT to @'%' negative scan retained). Static checks: bash -n / dash -n / helm lint / helm template all pass. helm template confirms 3 alpha.67 CmpD names (semisync/replication/galera). Live-gate runtime acceptance unchanged from alpha.66 v1: - new CmpD mariadb-semisync-1.1.1-alpha.67 Available - syncer log hits "switch to admin db" - local AdminDB CURRENT_USER() == kb_internal_root@127.0.0.1 - SHOW CREATE USER 'kb_internal_root'@'%' contains ACCOUNT LOCK - SHOW GRANTS FOR 'kb_internal_root'@'%' returns USAGE only / no admin priv (alpha.67 v1 strengthens this — REVOKE enforces zero-priv even on pre-existing record) - user-facing root @%, @localhost, @127.0.0.1 still no admin bypass - stable window 0 hit on Error 1227, Take the leader failed, demote failed, RoleProbeNotDone - vcluster-only execution in idc/idc1/idc2/idc4 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@localhost

…llowlist (alpha.68 v2) Closes Jack 15:39 alpha.67 v1 install/script live-gate RED + 15:45 alpha.68 v1 design HOLD + 15:58 alpha.68 v2 Direction B ACCEPT with refined checkpoint #3. alpha.67 v1 LOCKED+zero-priv kb_internal_root@'%' detection-only record correctly satisfied syncer IsAdminCreated host='%' detection, but broke cross-member syncer auth: syncer GetMemberConnection uses Admin credential (= kb_internal_root via MYSQL_ADMIN_USER) for cross-pod TCP, which authenticates via @'%'. LOCKED leads to Error 4151 Access denied; secondary cannot poll leader health; cluster stays in RoleProbeNotDone forever (1064 instances observed in alpha.67 v1 live gate). Helen 15:53 SQL matrix audit established the cross-member exact grant requirement: - IsReadonly (slave.go): SELECT global vars — USAGE only - IsMemberLagging / ReadCheck (manager.go): SELECT on kubeblocks.kb_health_check - IsMemberHealthy leader-only WriteCheck: INSERT/UPDATE on kubeblocks.kb_health_check (CREATE fallback handled by primary_local_root_write_ready local bootstrap; cross-pod path reaches this only after table pre-exists) - setSemiSyncSourceTimeout (semi_sync.go, Follow secondary -> leader): REPLICATION MASTER ADMIN (admin-bypass class) Jack 15:58 refined checkpoint #3: no NEW net capability vs root@'%' which already has REPLICATION MASTER ADMIN via alpha.64 v1 contract; root and kb_internal_root share MARIADB_ROOT_PASSWORD, so net attack- surface delta = 0 for REPLICATION MASTER ADMIN. Still hard-forbidden: ALL PRIVILEGES / SUPER / READ_ONLY ADMIN / CONNECTION ADMIN / BINLOG ADMIN / REPLICATION SLAVE ADMIN / DELETE / DROP / CREATE USER / schema-wide DML / CREATE on kubeblocks.*. alpha.68 v2 changes (chart-only; syncer source untouched): cmpd-semisync.yaml ensure_internal_local_admin SQL: @'%' from LOCKED+REVOKE to UNLOCK+REVOKE+3 grants. Ordering: CREATE -> UNLOCK -> REVOKE -> GRANT REPLICATION CLIENT -> GRANT REPLICATION MASTER ADMIN -> GRANT SELECT, INSERT, UPDATE ON kubeblocks.kb_health_check. No CREATE grant on kubeblocks.* because primary_local_root_write_ready pre-creates the table during local primary bootstrap before role publish. Chart.yaml bump 1.1.1-alpha.67 -> 1.1.1-alpha.68 (KB CmpD immutability rule). Cumulative comment block preserves alpha.65 v1 + alpha.66 v1 + alpha.67 v1 rationale; alpha.68 v2 block documents the security boundary trade-off explicitly (Direction B risk acceptance, NOT zero risk), plus alpha.69 mandatory blocking debt (syncer source change to restore alpha.67 v1 LOCKED+zero-priv boundary). ShellSpec increments (+8 net, 279 examples / 0 failures): - Renamed alpha.65 v1 / alpha.66 v1 chart-version regression tests to assert current literal 1.1.1-alpha.68. - Renamed alpha.67 v1 chart-version test to alpha.68 with the same immutability-rule rationale. - alpha.66 v1 SUPERSEDED tests: @'%' LOCK assertion -> "not LOCK" (UNLOCK is the new state); zero-GRANT @'%' -> allowlist exact-match (only the 3 expected GRANTs). - Removed alpha.67 v1 ordering test (was CREATE -> REVOKE -> LOCK; alpha.68 v2 introduces new ordering). - New Describe `alpha.68 v2 ensure_internal_local_admin cross-member health grant allowlist` with 8 examples in 3 contexts: - "@'%' UNLOCK and 3 cross-member grants present" (5): ACCOUNT UNLOCK present + not LOCK; REPLICATION CLIENT on *.*; REPLICATION MASTER ADMIN on *.*; SELECT/INSERT/UPDATE on kubeblocks.kb_health_check; ordering CREATE < UNLOCK < REVOKE < GRANT REPLICATION CLIENT. - "@'%' forbidden-priv negative hard gate" (2): no ALL PRIVILEGES / SUPER / READ_ONLY ADMIN / CONNECTION ADMIN / BINLOG ADMIN / REPLICATION SLAVE ADMIN / DELETE / DROP / CREATE USER to @'%'; no CREATE on kubeblocks.* to @'%'. - "alpha.64+.65+.66+.67 contract no-regression spot-check" (1): @localhost / @127.0.0.1 internal exception preserved; alpha.64 v3 root-cause comment marker still present. Static checks: bash -n / dash -n / helm lint / helm template all pass. helm template confirms 3 alpha.68 CmpD names (semisync/replication/galera). Live-gate runtime acceptance (per Jack 15:58 + 16:00): - new CmpD mariadb-semisync-1.1.1-alpha.68 Available - syncer log hits "switch to admin db" - local AdminDB CURRENT_USER() == kb_internal_root@127.0.0.1 - cross-pod (pod1 -> pod0 IP) AdminDB CURRENT_USER() == kb_internal_root@% - SHOW GRANTS FOR 'kb_internal_root'@'%' exact allowlist match: USAGE + REPLICATION CLIENT + REPLICATION MASTER ADMIN + SELECT/INSERT/UPDATE on kubeblocks.kb_health_check; no forbidden privileges - user-facing root @%, @localhost, @127.0.0.1 still no admin bypass - stable window 0 hit on Error 4151 / 1142 / 1044 / 1049 / 1146 / Take the leader failed / demote failed / RoleProbeNotDone / Tier B fail_closed token - vcluster-only execution in idc/idc1/idc2/idc4 alpha.69 mandatory blocking debt (9th sediment sample, queued): syncer source change so cross-member GetDBConnWithAddr uses a dedicated lower-priv credential (or syncer-side mechanism replaces direct cross- pod admin SQL). alpha.69 goal state restores kb_internal_root@'%' to alpha.67 v1 LOCKED + zero-priv. alpha.68 v2 is bounded short-term unblock, NOT a final design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@localhost

…arrow mysql.user grant (alpha.69 v1) Closes Jack 17:57 alpha.68 v2 install/script live-gate RED 3-evidence-chains closeout: Error 1146 SQL ordering / Error 1044 mysql DB access / Error 2002 downstream cross-pod listener. Jack 18:20 alpha.69 v1 design ACCEPT with runtime-acceptance tightening (MariaDB 11.4 SHOW GRANTS BINLOG MONITOR normalization). alpha.68 v2 had a bootstrap precondition gap: ensure_internal_local_admin runs from wait_for_internal_local_admin "startup-before-role-decision", which executes BEFORE primary_local_root_write_ready can create kubeblocks.kb_health_check. Fresh boots therefore hit Error 1146 (table missing) on the @'%' GRANT, wait_for_internal_local_admin loops forever, role decision is never reached, expose_sql_listener_for_*_role is never called, mariadbd stays bound to 127.0.0.1, and cross-pod TCP sees Error 2002. alpha.69 v1 closes this by adding CREATE DATABASE + CREATE TABLE inside ensure_internal_local_admin SQL BEFORE the @'%' GRANT block (idem- potent; ROOT_LOCAL via socket has GRANT ALL PRIVILEGES locally). alpha.69 v1 also closes Error 1044 (Access denied for kb_internal_root@'%' to database mysql). syncer's connection URL (apecloud/syncer engines/mysql/config.go line 71) includes `/mysql` as the default DB, so go-sql-driver issues `init_db = mysql` at handshake. alpha.68 v2 grants on @'%' did not include any priv on the `mysql` schema; the init_db handshake failed with 1044, cross-pod auth never established. GRANT SELECT ON mysql.user is the narrow table-specific privilege that satisfies init_db. Net attack-surface delta = 0 vs root@'%' which already has SELECT on *.* via alpha.64 v1 CMPD_EXPLICIT_PRIMARY_GRANT_BODY (root and kb_internal_root share MARIADB_ROOT_PASSWORD). alpha.69 v1 changes (chart-only; syncer source untouched): cmpd-semisync.yaml ensure_internal_local_admin SQL @'%' section now: CREATE DATABASE IF NOT EXISTS kubeblocks; CREATE TABLE IF NOT EXISTS kubeblocks.kb_health_check(type INT, check_ts BIGINT, PRIMARY KEY(type)); CREATE USER IF NOT EXISTS '${user}'@'%' IDENTIFIED BY '${password}'; ALTER USER '${user}'@'%' ACCOUNT UNLOCK; REVOKE ALL PRIVILEGES, GRANT OPTION FROM '${user}'@'%'; GRANT REPLICATION CLIENT ON *.* TO '${user}'@'%'; GRANT REPLICATION MASTER ADMIN ON *.* TO '${user}'@'%'; GRANT SELECT, INSERT, UPDATE ON kubeblocks.kb_health_check TO '${user}'@'%'; GRANT SELECT ON mysql.user TO '${user}'@'%'; FLUSH PRIVILEGES; Chart.yaml bump 1.1.1-alpha.68 -> 1.1.1-alpha.69 (KB CmpD immutability rule). Cumulative comment block preserved (alpha.65 + alpha.66 v1 + alpha.67 v1 + alpha.68 v2 + alpha.69 v1). alpha.69 v1 block documents the bootstrap precondition closure + narrow init_db grant rationale + MariaDB 11.4 SHOW GRANTS normalization + alpha.70+ mandatory blocking debt rename (was alpha.69, renamed because chart-only short-term ships alongside as alpha.69). MariaDB 11.4 SHOW GRANTS normalization (Jack 18:20 runtime-acceptance tightening): `GRANT REPLICATION CLIENT ON *.*` is the backward- compatible source syntax; SHOW GRANTS displays the normalized form `BINLOG MONITOR ON *.*` (MariaDB 11.4 split REPLICATION CLIENT into BINLOG MONITOR + SLAVE MONITOR). `BINLOG MONITOR` in SHOW GRANTS output is the positive normalized form of our REPLICATION CLIENT grant and is allowed on kb_internal_root@'%'; this is DIFFERENT from `BINLOG ADMIN`, which remains in the forbidden admin-bypass list. Source-side ShellSpec tests check the literal source SQL (GRANT REPLICATION CLIENT); runtime live-gate SHOW GRANTS acceptance uses semantic-equivalent matching (accept BINLOG MONITOR as the normalized form). alpha.70+ mandatory blocking debt (renamed; was alpha.69 in earlier planning): syncer source change so cross-member GetDBConnWithAddr uses a dedicated lower-priv credential AND removes `/mysql` from the connection DSN (or syncer-side mechanism replaces direct cross-pod admin SQL such as setSemiSyncSourceTimeout). alpha.70+ goal state restores kb_internal_root@'%' to alpha.67 v1 LOCKED + zero-priv (clean security boundary). alpha.69 v1 is bounded short-term unblock, NOT a final design. ShellSpec increments (164 examples / 0 failures in replication_switch- over_spec.sh; baseline was 156 + 8 alpha.68 v2 = 164 after rename/ SUPERSEDED merge in alpha.69 v1): - 3 chart-version regression tests (alpha.65/.66/.67 chart-bump) updated to assert literal 1.1.1-alpha.69. - alpha.66 v1 SUPERSEDED test allowlist regex extended to include the 4th grant (GRANT SELECT ON mysql.user) — alpha.68 v2 only had 3 grants on @'%', alpha.69 v1 adds the narrow mysql.user grant. - New Describe `alpha.69 v1 ensure_internal_local_admin bootstrap SQL ordering + mysql.user narrow grant` with 3 contexts: - "1146 fix — CREATE DATABASE/TABLE before @'%' GRANT" (2 examples). - "1044 fix — narrow GRANT SELECT ON mysql.user" (2 examples; positive + no broader mysql.* grants). - "1146/1044 fix SQL ordering" (1 example; awk-scoped 9-step ordering inside ensure_internal_local_admin function body, comment lines filtered out). Static checks: bash -n / dash -n / helm lint / helm template all pass. helm template confirms 3 alpha.69 CmpD names (semisync/replication/ galera) and -pcr variants. Pre-existing ShellSpec failures (unchanged from alpha.68 v2 baseline, NOT caused by alpha.69 v1): 58/60 in semisync_rejoin_fence_template_ spec.sh + 1/1 path-bug in standalone_template_mapping_spec.sh (cwd double-path awk failure). Carried as separate alpha.70+ cleanup item. Live-gate runtime acceptance (per Jack 18:20 ACCEPT + 18:24 clarification): - new CmpD mariadb-semisync-1.1.1-alpha.69 Available - syncer log hits "switch to admin db" - local AdminDB CURRENT_USER() == kb_internal_root@127.0.0.1 - cross-pod (pod1 -> pod0 IP) AdminDB CURRENT_USER() == kb_internal_root@% - SHOW GRANTS FOR 'kb_internal_root'@'%' contains positive normalized forms only: USAGE + (BINLOG MONITOR == REPLICATION CLIENT positive normalized) + REPLICATION MASTER ADMIN + SELECT/INSERT/UPDATE on kubeblocks.kb_health_check + SELECT on mysql.user; no BINLOG ADMIN (forbidden admin bypass), no ALL PRIVILEGES, no SUPER, no READ_ONLY ADMIN, no CONNECTION ADMIN, no REPLICATION SLAVE ADMIN. - user-facing root @%, @localhost, @127.0.0.1 still no admin bypass - stable window 0 hit on Error 4151 / 1142 / 1044 / 1049 / 1146 / 2002 / Take the leader failed / demote failed / RoleProbeNotDone / Tier B fail_closed token - vcluster-only execution in idc/idc1/idc2/idc4 alpha.70+ mandatory blocking debt (10th sediment sample, queued): syncer source change so cross-member GetDBConnWithAddr uses dedicated lower-priv credential + removes `/mysql` from DSN. alpha.70+ goal state restores kb_internal_root@'%' to alpha.67 v1 LOCKED + zero-priv. alpha.69 v1 is bounded short-term unblock, NOT a final design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

weicao and others added 2 commits May 9, 2026 15:05

chore: auto generated files

b4672e3

weicao requested review from a team and leon-ape as code owners May 9, 2026 07:06

fix(mariadb): close switchover write window under load

43f645d

weicao added 23 commits May 9, 2026 17:06

fix(mariadb): gate primary role on peer-reachable listener

941e02e

fix(mariadb): fence semisync rejoin before replica catchup

32cfb76

fix(mariadb): fence local root during semisync rejoin

e6465ec

fix(mariadb): persist local root fence before pod stop

97fcf82

fix(mariadb): prevent dual primary during fresh semisync bootstrap

2b675e5

fix(mariadb): clear local health table before fresh catchup

2ad5e12

fix(mariadb): start fresh replica SQL after health cleanup

8464808

fix(mariadb): wait for internal admin before role publish

60ead54

fix(mariadb): verify semisync internal admin privileges

05418b7

fix(mariadb): gate semisync admin privileges before role publish

43a0e8e

fix(mariadb): configure semisync secondary follow at runtime

a3ac52a

fix(mariadb): use internal admin for role probe status

ae1e42c

fix(mariadb): reconcile semisync primary promotion

4f134e4

fix(mariadb): extend switchover convergence window

e764876

fix(mariadb): repair switchover follow convergence

38af392

fix(mariadb): gate primary publish on local writes

567448e

fix(mariadb): close switchover action invariants

f1817ee

fix(mariadb): strengthen switchover primary closure

1b4979a

weicao and others added 18 commits May 11, 2026 00:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mariadb): add 11.4 replication and semisync hardening#2633

feat(mariadb): add 11.4 replication and semisync hardening#2633
weicao wants to merge 44 commits into
mainfrom
feat/mariadb-alpha37-semisync-fencing-pr

weicao commented May 9, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weicao commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Local validation

Current retest package

alpha.69 v1 fix scope (closes alpha.68 v2 install/script live-gate RED 3 evidence chains: 1146 + 1044 + 2002)

alpha.68 v2 fix scope (superseded by alpha.69 v1 above for the @'%' bootstrap precondition + narrow init_db grant; alpha.68 v2 UNLOCK + cross-member grant allowlist preserved unchanged)

alpha.68 v2 fix scope (closes alpha.67 v1 install/script live-gate RED on cross-member syncer auth via LOCKED @'%')

alpha.67 v1 fix scope (superseded by alpha.68 v2 above for cross-member syncer auth; alpha.67 v1 write-site REVOKE pattern preserved at lower priority — alpha.68 v2 still REVOKE + UNLOCK + 3 grants, alpha.69 v1 adds 4th grant)

alpha.67 v1 fix scope (closes alpha.66 v1 package-level review HOLD on @'%' zero-priv write-site contract gap)

alpha.66 v1 fix scope (superseded by alpha.67 v1 above for the @'%' zero-priv write-site enforcement; alpha.66 v1 syncer HA executor swap + chart bump path preserved)

alpha.66 v1 fix scope (closes alpha.65 v2 install/script live-gate RED on syncer HA executor privilege mismatch)

alpha.65 v2 fix scope (superseded by alpha.66 v1 above for the syncer HA executor privilege mismatch; alpha.65 v1+v2 chart bump path + Chart.yaml comment-grep removal preserved)

alpha.65 v2 fix scope (vs v1 commit ea4e7aa0)

alpha.65 v1 fix scope (superseded by v2 above for the Chart.yaml comment-grep ShellSpec example removal; Chart.yaml chart version bump + cmpd-semisync.yaml content all preserved unchanged)

alpha.65 v1 fix scope (closes alpha.64 v3 install/script live-gate RED on KubeBlocks ComponentDefinition immutability)

alpha.64 v3 fix scope (superseded by alpha.65 v1 above for the chart-version bump triggered by KubeBlocks ComponentDefinition immutability; alpha.64 v1+v2+v3 cmpd-semisync.yaml content all preserved verbatim in alpha.65)

alpha.64 v3 fix scope (vs v2 commit 73072452)

alpha.64 v2 fix scope (superseded by v3 above for the multi-word MONITOR priv shell-splitting; v2 caller-side rc propagation + tier annotation + preStop fail-closed token all preserved)

alpha.64 v2 fix scope (vs v1 commit 222d36bf)

alpha.64 v1 fix scope (superseded by v2 above for caller-side rc propagation; v1 grant body alignment preserved unchanged)

alpha.64 v1 fix scope (closes alpha.63 fresh-gatefix switchover N=1 RED root cause)

alpha.63 v2 fix scope (superseded by alpha.64 v1 above for the cmpd-side runtime grant body alignment; alpha.63 v1+v2 verifier impl + grant_option_residual contract retained — STILL UNVALIDATED at runtime)

alpha.63 v2 fix scope (vs v1 commit 423703eb)

alpha.63 v1 fix scope (superseded by v2 above for the GRANT OPTION token semantic; v1 I-1 + I-2 fixes retained)

alpha.63 v1 fix scope (closes alpha.62 switchover N=1 RED)

alpha.62 v2 fix scope (superseded by alpha.63 v1 above for the verifier implementation; alpha.62 v2 design contract retained)

alpha.62 v2 fix scope (vs v1 commit 675f5371)

alpha.62 v1 fix scope (superseded by v2 above for comment-level cleanup + grants_sha format; runtime contract unchanged)

alpha.62 v1 fix scope (closes alpha.61 switchover N=1 RED)

alpha.61 v3 fix scope (superseded by alpha.62 v1 above for the switchover-fence contract drift; alpha.61 v3 deadline/timeout/POSIX semantics retained)

alpha.61 v3 fix scope (vs v2 commit 44f55dea)

alpha.61 v2 fix scope (superseded by v3 above)

alpha.61 v2 fix scope (vs v1 commit 63f91d18)

alpha.61 fix scope (v1 — superseded by v2 above)

alpha.60 v3 fix scope

alpha.60 v2 fix scope

alpha.60 review-fix scope

alpha.59 v2 review-fix scope

Latest focused evidence

Boundary

Uh oh!

codecov-commenter commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weicao commented May 9, 2026 •

edited

Loading

alpha.68 v2 fix scope (superseded by alpha.69 v1 above for the `@'%'` bootstrap precondition + narrow init_db grant; alpha.68 v2 UNLOCK + cross-member grant allowlist preserved unchanged)

alpha.68 v2 fix scope (closes alpha.67 v1 install/script live-gate RED on cross-member syncer auth via LOCKED `@'%'`)

alpha.65 v2 fix scope (vs v1 commit `ea4e7aa0`)

alpha.64 v3 fix scope (vs v2 commit `73072452`)

alpha.64 v2 fix scope (vs v1 commit `222d36bf`)

alpha.63 v2 fix scope (vs v1 commit `423703eb`)

alpha.62 v2 fix scope (vs v1 commit `675f5371`)

alpha.61 v3 fix scope (vs v2 commit `44f55dea`)

alpha.61 v2 fix scope (vs v1 commit `63f91d18`)

codecov-commenter commented May 9, 2026 •

edited

Loading