DoubleML · SvenKlaassen · Jan 31, 2026 · Feb 1, 2026 · Feb 1, 2026 · Feb 1, 2026
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -0,0 +1,85 @@
+# DoubleML for Python
+
+DoubleML is a Python package implementing Double/Debiased Machine Learning (DML) methods for causal inference:
+- Partially Linear Models (PLR, PLIV, PLPR, LPLR)
+- Interactive Regression Models (IRM, IIVM, APO, QTE, CVAR, SSM)
+- Difference-in-Differences estimators (DID, DIDCSBinary, DIDMulti)
+- Regression Discontinuity Design (RDD)
+
+**Docs**: https://docs.doubleml.org | **Source**: https://github.com/DoubleML/doubleml-for-py
+
+**Branch status & TODOs**: `.claude/STATUS.md`
+
+## Architecture
+
+### Class Hierarchy
+```
+DoubleMLBase (ABC)
+└─> DoubleMLScalar (ABC) - single-parameter models
+    ├─> LinearScoreMixin - closed-form solver (θ = -E[ψ_b]/E[ψ_a])
+    │   ├─> DoubleMLPLR
+    │   ├─> DoubleMLIRM
+    │   ├─> DoubleMLPLIV
+    │   ├─> DoubleMLIIVM
+    │   └─> DoubleML DID variants
+    └─> NonLinearScoreMixin - numerical solver (planned)
+
+DoubleML - multi-parameter estimation (extends DoubleMLScalar)
+```
+
+### Design Patterns
+- **Template Method**: `fit()` orchestrates; subclasses implement `_nuisance_est()`, `_get_score_elements()`
+- **Mixin Pattern**: `LinearScoreMixin` provides closed-form coefficient estimation
+- **Delegation**: `DoubleMLBase` delegates inference to `DoubleMLFramework`
+
+### Core Files
+| File | Purpose |
+|------|---------|
+| `doubleml/double_ml_base.py` | Abstract base with properties (coef, se, summary) and inference |
+| `doubleml/double_ml_scalar.py` | Single-parameter estimation orchestrator |
+| `doubleml/double_ml.py` | Multi-parameter estimation with sample splitting |
+| `doubleml/double_ml_framework.py` | Statistical inference (confint, bootstrap, sensitivity) |
+| `doubleml/double_ml_linear_score.py` | Linear score mixin |
+
+### Package Structure
+```
+doubleml/
+├── data/          # Data containers (DoubleMLData, DoubleMLDIDData, etc.)
+├── plm/           # Partially Linear Models (PLR, PLIV, PLPR, LPLR)
+├── irm/           # Interactive Regression Models (IRM, IIVM, APO, QTE, etc.)
+├── did/           # Difference-in-Differences estimators
+├── rdd/           # Regression Discontinuity Design
+├── utils/         # Helpers (_checks, _estimation, resampling, tuning)
+└── tests/         # Main test directory
+```
+
+## Key Dependencies
+
+**Core**: numpy>=2.0.0, pandas>=2.0.0, scipy>=1.7.0, scikit-learn>=1.6.0, statsmodels>=0.14.0
+**ML/Tuning**: optuna>=4.6.0, joblib>=1.2.0
+**Visualization**: matplotlib>=3.9.0, seaborn>=0.13, plotly>=5.0.0
+**Dev**: pytest>=8.3.0, black>=25.1.0, ruff>=0.11.1, mypy>=1.18.0, xgboost>=2.1.0, lightgbm>=4.6.0
+
+## Git Workflow
+
+- **Main branch**: `main`
+- **Commits**: Conventional Commits — `feat:`, `fix:`, `docs:`, `refactor:`, `test:`, `chore:`
+
+## Verification
+
+Before completing any task:
+```bash
+black .                    # Format
+ruff check --fix .         # Lint
+mypy doubleml              # Type check
+pytest -m ci               # Tests
+```
+
+## Coding Standards
+
+Detailed conventions are in `.claude/rules/`:
+- **py-code-conventions.md** — Formatting, type hints, docstrings, naming, DML-specific patterns
+- **error-handling.md** — Exception types, validation patterns, warnings vs. errors
+- **performance-guidelines.md** — Vectorization, pre-allocation, DML computation patterns
+- **testing-conventions.md** — Markers, fixtures, assertion patterns
+- **dml-scalar-test-structure.md** — Mandatory 5-file test structure for scalar models
diff --git a/.claude/STATUS.md b/.claude/STATUS.md
@@ -0,0 +1,75 @@
+# Branch Status & TODOs
+
+> Tracked in git so it syncs across machines. Update this file as work progresses.
+> Reference: `CLAUDE.md` loads this automatically via the line below.
+
+---
+
+## Branch: `sk-refactoring`
+
+**Goal**: Introduce a new `DoubleMLScalar` / `DoubleMLVector` hierarchy alongside
+the existing `DoubleML` API — cleaner design, better testability, explicit tuning,
+nuisance evaluation, and sensitivity analysis.
+
+### Completed
+
+- [x] **Claude tooling** — `.claude/` dir, `CLAUDE.md`, `rules/`, `agents/`, `skills/`
+- [x] **Architecture docs** — `doc/diagrams/architecture.md`, `doc/diagrams/testing_structure.md`
+- [x] **`DoubleMLBase`** — abstract base with shared properties (`coef`, `se`, `summary`) and inference delegation (`doubleml/double_ml_base.py`)
+- [x] **`LinearScoreMixin`** — closed-form θ = −E[ψ_b]/E[ψ_a] solver (`doubleml/double_ml_linear_score.py`)
+- [x] **`DoubleMLScalar`** — single-parameter orchestrator (`doubleml/double_ml_scalar.py`) with:
+  - `fit()` → `draw_sample_splitting()` + `fit_nuisance_models()` + `estimate_causal_parameters()`
+  - `tune_ml_models()` via Optuna (`_LEARNER_PARAM_ALIASES`, `_get_tuning_data()` hook)
+  - `nuisance_targets`, `nuisance_loss`, `evaluate_learners()`
+  - `_sensitivity_element_est()` hook + full sensitivity analysis pipeline
+- [x] **`DoubleMLPLRScalar`** — PLR scalar (`doubleml/plm/plr_scalar.py`) with all 7 test files:
+  - `test_plr_scalar.py`, `_return_types`, `_exceptions`, `_vs_plr`, `_external_predictions`, `_tune_ml_models`, `_evaluate_learners`, `_sensitivity`
+- [x] **`DoubleMLIRMScalar`** — IRM scalar (`doubleml/irm/irm_scalar.py`) with all 7 test files (same structure)
+- [x] **`cate()` + `gate()` for IRM scalar** — `doubleml/irm/irm_scalar.py` + `test_irm_scalar_cate_gate.py`
+- [x] **`cate()` + `gate()` + `_partial_out()` for PLR scalar** — `doubleml/plm/plr_scalar.py` + `test_plr_scalar_cate_gate.py`. Multi-rep × multi-column basis fully supported.
+- [x] **`DoubleMLBLP` per-rep basis API** — `basis` may be a single `pd.DataFrame` (shared) or a `list[pd.DataFrame]` of length `n_rep`. Also fixes the legacy `DoubleMLPLR.cate()` multi-rep bug (`basis * D_tilde` mis-broadcast for `n_rep>1` and `d_basis>1`).
+- [x] **`DoubleMLVector`** — multi-treatment base class first iteration (`doubleml/double_ml_vector.py`)
+- [x] **BLP multi-rep support** — `doubleml/utils/blp.py`
+- [x] **`PLRVector`** — first concrete `DoubleMLVector` subclass (`doubleml/plm/plr_vector.py`) with 5 test files: `test_plr_vector.py`, `_return_types`, `_exceptions`, `_vs_plr`, `_external_predictions`. Validates exact equivalence with legacy `DoubleMLPLR` for multi-treatment.
+
+### In Progress
+
+_(none)_
+
+### Feature Gaps vs Legacy Classes
+
+Missing from `PLR` / `IRM` scalar compared to `DoubleMLPLR` / `DoubleMLIRM`:
+
+| Feature | Legacy location | Applies to | Notes |
+|---------|----------------|-----------|-------|
+| `cate()` | `plr.py:447`, `irm.py:564` | — | ✅ ported for both IRM and PLR |
+| `gate()` | `plr.py:485`, `irm.py:598` | — | ✅ ported for both IRM and PLR |
+| `_partial_out()` | `plr.py:522` | — | ✅ ported for PLR scalar |
+| `policy_tree()` | `irm.py:635` | IRM only | Not planned yet |
+
+Weighted effects in IRM (`weights` dict form):
+- Array weights: ✅ supported
+- Dict weights with `weights_bar`: ✅ supported — init defers the `n_rep` column check; `DoubleMLScalar._check_smpls_dependent_inputs()` hook validates `weights_bar.shape == (n_obs, n_rep)` from inside both `draw_sample_splitting()` and `set_sample_splitting()`. `fit(n_folds=..., n_rep=...)` re-draws splits with a `UserWarning` when args conflict with existing splits.
+
+Intentionally **not ported**:
+- Callable score — design decision
+- `trimming_rule` / `trimming_threshold` deprecated props — use `ps_processor_config`
+
+### Planned
+
+| Item | Files | Notes |
+|------|-------|-------|
+| `DoubleMLIRMVector` | `doubleml/irm/irm_vector.py` + tests | Next concrete Vector subclass |
+| `DoubleMLPLIVScalar` | `doubleml/plm/pliv_scalar.py` + 7 test files | Next scalar model |
+| `DoubleMLPLPRScalar` | `doubleml/plm/plpr_scalar.py` + 7 test files | |
+| DID scalar variants | `doubleml/did/*_scalar.py` | DID, DIDCSBinary, DIDMulti |
+| `DoubleMLVector` tests | `doubleml/tests/test_vector_*.py` | Base class tests |
+
+---
+
+## How to Update This File
+
+- Mark items `[x]` when complete
+- Move items between sections as work progresses
+- Add new planned items as they are identified
+- Commit this file with the relevant code changes so the status stays in sync
diff --git a/.claude/agents/py-general-reviewer.md b/.claude/agents/py-general-reviewer.md
@@ -0,0 +1,59 @@
+---
+name: py-general-reviewer
+description: Professional Python code reviewer focusing on logic, performance, and best practices. Uses a debate-driven approach to minimize false positives.
+tools: Read, Grep, Glob, Bash
+model: inherit
+---
+
+Review Python code changes for functional correctness and industry-standard best practices. Report issues only — never edit source files.
+
+## Workflow
+
+1. **Identify Changes**: Run `git diff --name-only HEAD~1` to identify changed `.py` files.
+2. **Read**: Read the content of each modified file.
+3. **Internal Debate**: For each file, simulate a dialogue:
+   - **@Auditor**: Finds potential bugs, edge cases, and "code smells."
+   - **@Author**: Defends the implementation (e.g., "This is a temporary shim" or "Performance requires this complexity").
+   - **@Resolution**: Agree on the final list of actionable improvements.
+4. **Output**: Use the "Final Review" format specified below.
+
+## Review Checklist
+
+### 🔴 Critical (Bug Risk / Logic)
+- **Edge Cases**: Unhandled `None` values, empty lists, or `0` divisors.
+- **Resource Leaks**: Files or network sockets opened without `with` blocks.
+- **Mutable Defaults**: Using `list` or `dict` as default arguments in functions.
+- **Concurrency**: Thread-safety issues or race conditions in shared state.
+- **Logic Errors**: Off-by-one errors or incorrect boolean logic in complex conditionals.
+
+### 🟡 Warning (Best Practices / Clean Code)
+- **Complexity**: Functions longer than 50 lines or nesting deeper than 3 levels.
+- **DRY (Don't Repeat Yourself)**: Significant logic duplication that should be a helper function.
+- **Error Handling**: Using "bare" `except:` blocks instead of specific exceptions.
+- **Type Hinting**: Public APIs missing type annotations for parameters or return values.
+- **Hardcoding**: URLs, credentials, or magic numbers that should be constants/config.
+
+### 🟢 Suggestion (Style / Optimization)
+- **Vectorization**: Using loops where NumPy or Pandas operations would be $O(1)$ or significantly faster.
+- **Built-ins**: Re-implementing logic that exists in `itertools`, `collections`, or `pathlib`.
+- **Docstrings**: Missing or outdated descriptions of function intent.
+
+## Output Format
+
+```markdown
+## Final Review: `<filename>`
+
+### ⚖️ The Debate Summary
+[1-2 sentences on what was debated between the Auditor and Author.]
+
+### 🚫 Resolved Issues (Blocking)
+- **line N**: [issue]. **Fix**: `<concrete_code_fix>`
+
+### ⚠️ Resolved Warnings
+- **line N**: [issue]. **Consider**: `<suggestion>`
+
+### ✅ Dismissed (False Positives)
+- **line N**: [Original concern] -> [Reason for dismissal]
+
+### Summary
+[Final assessment: e.g., "3 issues found (1 critical, 2 warnings)"]
diff --git a/.claude/agents/py-reviewer.md b/.claude/agents/py-reviewer.md
@@ -0,0 +1,66 @@
+---
+name: py-reviewer
+description: Python code reviewer for DoubleML. Checks type safety, learner handling, score contracts, and test coverage. Use after writing or modifying Python files.
+tools: Read, Grep, Glob, Bash
+model: inherit
+---
+
+Review Python code changes against DoubleML project conventions. Report issues only — never edit source files.
+
+## Workflow
+
+1. Run `git diff --name-only HEAD~1` to identify changed files (use Bash)
+2. Read each changed `.py` file
+3. Review against the checklist below
+4. Output findings in the format specified
+
+## Review Checklist
+
+### Critical (must fix — blocks merge)
+- **Type hints**: All functions have parameter types and return types. Missing `-> None` counts.
+- **`from __future__ import annotations`**: Present when class methods reference their own type (forward refs)
+- **Learner validation**: `_check_learner()` called for every user-provided learner
+- **Learner cloning**: `clone(learner)` before `.fit()` — learners are mutable
+- **Score contract**: `_get_score_elements()` returns `{'psi_a': ..., 'psi_b': ...}` with shape `(n_obs,)`
+- **Sample splitting**: Uses `DoubleMLResampling`, never raw `KFold`
+- **Test markers**: Every test function has `@pytest.mark.ci`
+- **Exception messages**: Include expected vs. actual values (`got {value}`)
+
+### Warnings (should fix)
+- **Module docstring**: File starts with `"""..."""` describing the module
+- **NumPy-style docstrings**: Public functions/classes have Parameters + Returns sections
+- **Naming**: Classes use `DoubleML` prefix, score elements use `psi_a`/`psi_b`, stats use `theta`/`se`/`n_obs`
+- **Magic numbers**: Unexplained numeric literals (should be named constants)
+- **Vectorization**: Python loops over `n_obs`-sized arrays (should be NumPy ops)
+- **Error handling**: `_check_*` helpers from `doubleml/utils/_checks.py` used where applicable
+
+### Suggestions (nice to have)
+- **Property vs. method**: Cheap computed attributes should be `@property`, side effects should be methods
+- **Decorator usage**: `@staticmethod` for `_check_data()`, `@abstractmethod` for template hooks
+- **Class vs. instance variables**: `_LEARNER_SPECS`/`_VALID_SCORES` should be class-level
+
+### Intentionally Acceptable (do NOT flag)
+- `Any` type for scikit-learn estimators and learner objects
+- `E721` type comparisons (`type(x) == Y`) — intentionally allowed by ruff config
+- Test files without type annotations — excluded from mypy
+- `# type: ignore` when suppressing third-party library issues (not own code)
+
+## Output Format
+
+```markdown
+## Code Review: `<filename>`
+
+### Critical
+- **line N**: [issue description]. Fix: `<concrete code fix>`
+
+### Warnings
+- **line N**: [issue description]. Consider: `<suggestion>`
+
+### Suggestions
+- **line N**: [issue description]
+
+### Summary
+[1-2 sentences: overall assessment, number of issues by severity]
+```
+
+Review each changed file separately. If no issues found, state "No issues found" for that file.