Skip to content

feature_selection: add top_k to EFS.get_metric_dict (#610)#1165

Merged
rasbt merged 2 commits into
rasbt:masterfrom
jbbqqf:feat/610-efs-topk
Jun 6, 2026
Merged

feature_selection: add top_k to EFS.get_metric_dict (#610)#1165
rasbt merged 2 commits into
rasbt:masterfrom
jbbqqf:feat/610-efs-topk

Conversation

@jbbqqf

@jbbqqf jbbqqf commented May 9, 2026

Copy link
Copy Markdown
Contributor

Code of Conduct

I have read the project's Code of Conduct.

Description

ExhaustiveFeatureSelector evaluates every combination of features in [min_features, max_features], which can produce a very large number of subsets. The recurring user complaint in #610 is that calling get_metric_dict() and turning the result into a DataFrame materialises every evaluated subset in memory.

This PR adds an optional top_k keyword to get_metric_dict() that keeps only the top-K subsets ranked by avg_score (descending). top_k=None (default) preserves the historical behaviour exactly — same keys, same shape — so existing callers are unaffected.

The original iteration keys are kept rather than re-numbered, so downstream code can still cross-reference subsets_ using the same keys.

Related issues or pull requests

Fixes #610 (labelled easy by the maintainer)

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file
  • Added appropriate unit test functions in the ./mlxtend/feature_selection/tests/ directory (4 new tests)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (deferred — the docstring covers the new parameter; happy to add a notebook example if desired)
  • Ran PYTHONPATH='.' pytest ./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -sv — 27/27 passed
  • Checked for style issues by running flake8 ./mlxtend (clean) and black --check (clean after auto-format)

Reproduce BEFORE/AFTER yourself (copy-paste)

# --- one-time setup ---
git clone https://github.com/rasbt/mlxtend.git /tmp/repro-610 && cd /tmp/repro-610
python -m venv .venv && source .venv/bin/activate
pip install -e . pytest scikit-learn

# --- BEFORE (origin/master) ---
git checkout origin/master
python - <<'PY'
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
X, y = load_iris(return_X_y=True)
efs = EFS(KNeighborsClassifier(n_neighbors=4),
          min_features=1, max_features=4, scoring="accuracy",
          cv=3, print_progress=False, n_jobs=1).fit(X, y)
print("evaluated subsets:", len(efs.get_metric_dict()))
try:
    efs.get_metric_dict(top_k=3)
except TypeError as e:
    print("TypeError:", e)
PY
# Expected (BEFORE):
#   evaluated subsets: 15
#   TypeError: ExhaustiveFeatureSelector.get_metric_dict() got an unexpected keyword argument 'top_k'

# --- AFTER (this PR) ---
git fetch https://github.com/jbbqqf/mlxtend.git feat/610-efs-topk
git checkout FETCH_HEAD
python - <<'PY'
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
X, y = load_iris(return_X_y=True)
efs = EFS(KNeighborsClassifier(n_neighbors=4),
          min_features=1, max_features=4, scoring="accuracy",
          cv=3, print_progress=False, n_jobs=1).fit(X, y)
print("evaluated subsets:", len(efs.get_metric_dict()))
top3 = efs.get_metric_dict(top_k=3)
print("top_k=3 size:", len(top3))
for k, v in top3.items():
    print(f"  iter={k} avg_score={v['avg_score']:.4f} feature_idx={v['feature_idx']}")
PY
# Expected (AFTER):
#   evaluated subsets: 15
#   top_k=3 size: 3
#   <three lines, sorted by avg_score descending>

# --- Regression tests (same on both refs; fail BEFORE, pass AFTER) ---
PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q -k "issue_610"
# Expected (BEFORE): 4 failed
# Expected (AFTER):  4 passed

What I ran locally

  • PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q → 27/27 passed
  • PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q -k issue_610 → 4/4 passed
  • Same 4 tests run against origin/master's exhaustive_feature_selector.py: 4/4 fail with TypeError: get_metric_dict() got an unexpected keyword argument 'top_k'
  • flake8 + black --check on touched files → clean (after applying black)

Edge cases tested

# Scenario Input Expected Verified by
1 top_k returns the K highest-scoring subsets top_k=3 after fitting on iris with 1..3 features exactly 3 entries; identical to sorted(full, key=avg_score)[:3] test_get_metric_dict_top_k_returns_top_subsets_issue_610
2 top_k=None = unchanged behaviour unset / None identical keys + values to default call test_get_metric_dict_top_k_none_preserves_default_behavior_issue_610
3 Invalid top_k raises ValueError 0, -2, 1.5 ValueError: top_k must be a positive integer or None test_get_metric_dict_top_k_invalid_raises_issue_610
4 top_k larger than evaluated count top_k=1_000_000 returns all subsets without error test_get_metric_dict_top_k_larger_than_total_returns_all_issue_610

Risk / blast radius

Minimal. The new argument defaults to None, which is a no-op; every existing call site behaves identically. The validation only fires when top_k is provided, and the sort+slice is O(N log N) over the same dict that was already being constructed — no new memory pressure for the default path.

Release note

Add a `top_k` argument to `ExhaustiveFeatureSelector.get_metric_dict()` so callers can keep only the highest-scoring subsets without materialising every evaluated combination.

PR drafted with assistance from Claude Code. The change was reviewed manually against rasbt/mlxtend's source. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.

jbbqqf and others added 2 commits May 9, 2026 20:39
ExhaustiveFeatureSelector can evaluate a very large number of feature
subsets. Users hitting the recurring memory issue when calling
`get_metric_dict()` (and immediately turning it into a DataFrame) had to
re-implement the ranking themselves to keep only the top scorers.

Adds an optional `top_k` keyword argument that, when set, returns only
the top-K subsets ranked by `avg_score` descending. The default
`top_k=None` preserves the historical behaviour exactly. Original
iteration keys are kept so downstream code can still cross-reference
`subsets_`.

Issue rasbt#610 was labelled `easy` by the maintainer.

Co-Authored-By: Claude Code <noreply@anthropic.com>

@rasbt rasbt left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rasbt rasbt merged commit fd7bec2 into rasbt:master Jun 6, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding a topn parameter to the Exhaustive Feature Selector

2 participants