feature_selection: add top_k to EFS.get_metric_dict (#610) by jbbqqf · Pull Request #1165 · rasbt/mlxtend

jbbqqf · 2026-05-09T18:43:15Z

Code of Conduct

I have read the project's Code of Conduct.

Description

ExhaustiveFeatureSelector evaluates every combination of features in [min_features, max_features], which can produce a very large number of subsets. The recurring user complaint in #610 is that calling get_metric_dict() and turning the result into a DataFrame materialises every evaluated subset in memory.

This PR adds an optional top_k keyword to get_metric_dict() that keeps only the top-K subsets ranked by avg_score (descending). top_k=None (default) preserves the historical behaviour exactly — same keys, same shape — so existing callers are unaffected.

The original iteration keys are kept rather than re-numbered, so downstream code can still cross-reference subsets_ using the same keys.

Related issues or pull requests

Fixes #610 (labelled easy by the maintainer)

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
Added appropriate unit test functions in the ./mlxtend/feature_selection/tests/ directory (4 new tests)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (deferred — the docstring covers the new parameter; happy to add a notebook example if desired)
Ran PYTHONPATH='.' pytest ./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -sv — 27/27 passed
Checked for style issues by running flake8 ./mlxtend (clean) and black --check (clean after auto-format)

Reproduce BEFORE/AFTER yourself (copy-paste)

# --- one-time setup ---
git clone https://github.com/rasbt/mlxtend.git /tmp/repro-610 && cd /tmp/repro-610
python -m venv .venv && source .venv/bin/activate
pip install -e . pytest scikit-learn

# --- BEFORE (origin/master) ---
git checkout origin/master
python - <<'PY'
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
X, y = load_iris(return_X_y=True)
efs = EFS(KNeighborsClassifier(n_neighbors=4),
          min_features=1, max_features=4, scoring="accuracy",
          cv=3, print_progress=False, n_jobs=1).fit(X, y)
print("evaluated subsets:", len(efs.get_metric_dict()))
try:
    efs.get_metric_dict(top_k=3)
except TypeError as e:
    print("TypeError:", e)
PY
# Expected (BEFORE):
#   evaluated subsets: 15
#   TypeError: ExhaustiveFeatureSelector.get_metric_dict() got an unexpected keyword argument 'top_k'

# --- AFTER (this PR) ---
git fetch https://github.com/jbbqqf/mlxtend.git feat/610-efs-topk
git checkout FETCH_HEAD
python - <<'PY'
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
X, y = load_iris(return_X_y=True)
efs = EFS(KNeighborsClassifier(n_neighbors=4),
          min_features=1, max_features=4, scoring="accuracy",
          cv=3, print_progress=False, n_jobs=1).fit(X, y)
print("evaluated subsets:", len(efs.get_metric_dict()))
top3 = efs.get_metric_dict(top_k=3)
print("top_k=3 size:", len(top3))
for k, v in top3.items():
    print(f"  iter={k} avg_score={v['avg_score']:.4f} feature_idx={v['feature_idx']}")
PY
# Expected (AFTER):
#   evaluated subsets: 15
#   top_k=3 size: 3
#   <three lines, sorted by avg_score descending>

# --- Regression tests (same on both refs; fail BEFORE, pass AFTER) ---
PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q -k "issue_610"
# Expected (BEFORE): 4 failed
# Expected (AFTER):  4 passed

What I ran locally

PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q → 27/27 passed
PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q -k issue_610 → 4/4 passed
Same 4 tests run against origin/master's exhaustive_feature_selector.py: 4/4 fail with TypeError: get_metric_dict() got an unexpected keyword argument 'top_k'
flake8 + black --check on touched files → clean (after applying black)

Edge cases tested

#	Scenario	Input	Expected	Verified by
1	`top_k` returns the K highest-scoring subsets	`top_k=3` after fitting on iris with 1..3 features	exactly 3 entries; identical to `sorted(full, key=avg_score)[:3]`	`test_get_metric_dict_top_k_returns_top_subsets_issue_610`
2	`top_k=None` = unchanged behaviour	unset / `None`	identical keys + values to default call	`test_get_metric_dict_top_k_none_preserves_default_behavior_issue_610`
3	Invalid `top_k` raises `ValueError`	`0`, `-2`, `1.5`	`ValueError: top_k must be a positive integer or None`	`test_get_metric_dict_top_k_invalid_raises_issue_610`
4	`top_k` larger than evaluated count	`top_k=1_000_000`	returns all subsets without error	`test_get_metric_dict_top_k_larger_than_total_returns_all_issue_610`

Risk / blast radius

Minimal. The new argument defaults to None, which is a no-op; every existing call site behaves identically. The validation only fires when top_k is provided, and the sort+slice is O(N log N) over the same dict that was already being constructed — no new memory pressure for the default path.

Release note

Add a `top_k` argument to `ExhaustiveFeatureSelector.get_metric_dict()` so callers can keep only the highest-scoring subsets without materialising every evaluated combination.

PR drafted with assistance from Claude Code. The change was reviewed manually against rasbt/mlxtend's source. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.

ExhaustiveFeatureSelector can evaluate a very large number of feature subsets. Users hitting the recurring memory issue when calling `get_metric_dict()` (and immediately turning it into a DataFrame) had to re-implement the ranking themselves to keep only the top scorers. Adds an optional `top_k` keyword argument that, when set, returns only the top-K subsets ranked by `avg_score` descending. The default `top_k=None` preserves the historical behaviour exactly. Original iteration keys are kept so downstream code can still cross-reference `subsets_`. Issue rasbt#610 was labelled `easy` by the maintainer. Co-Authored-By: Claude Code <noreply@anthropic.com>

rasbt

LGTM!

jbbqqf and others added 2 commits May 9, 2026 20:39

Merge origin/master into feat/610-efs-topk

80a82ee

rasbt approved these changes Jun 6, 2026

View reviewed changes

rasbt merged commit fd7bec2 into rasbt:master Jun 6, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature_selection: add top_k to EFS.get_metric_dict (#610)#1165

feature_selection: add top_k to EFS.get_metric_dict (#610)#1165
rasbt merged 2 commits into
rasbt:masterfrom
jbbqqf:feat/610-efs-topk

jbbqqf commented May 9, 2026

Uh oh!

rasbt left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jbbqqf commented May 9, 2026

Code of Conduct

Description

Related issues or pull requests

Pull Request Checklist

Reproduce BEFORE/AFTER yourself (copy-paste)

What I ran locally

Edge cases tested

Risk / blast radius

Release note

Uh oh!

rasbt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants