feature_selection: add top_k to EFS.get_metric_dict (#610)#1165
Merged
Conversation
ExhaustiveFeatureSelector can evaluate a very large number of feature subsets. Users hitting the recurring memory issue when calling `get_metric_dict()` (and immediately turning it into a DataFrame) had to re-implement the ranking themselves to keep only the top scorers. Adds an optional `top_k` keyword argument that, when set, returns only the top-K subsets ranked by `avg_score` descending. The default `top_k=None` preserves the historical behaviour exactly. Original iteration keys are kept so downstream code can still cross-reference `subsets_`. Issue rasbt#610 was labelled `easy` by the maintainer. Co-Authored-By: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Code of Conduct
I have read the project's Code of Conduct.
Description
ExhaustiveFeatureSelectorevaluates every combination of features in[min_features, max_features], which can produce a very large number of subsets. The recurring user complaint in #610 is that callingget_metric_dict()and turning the result into a DataFrame materialises every evaluated subset in memory.This PR adds an optional
top_kkeyword toget_metric_dict()that keeps only the top-K subsets ranked byavg_score(descending).top_k=None(default) preserves the historical behaviour exactly — same keys, same shape — so existing callers are unaffected.The original iteration keys are kept rather than re-numbered, so downstream code can still cross-reference
subsets_using the same keys.Related issues or pull requests
Fixes #610 (labelled
easyby the maintainer)Pull Request Checklist
./docs/sources/CHANGELOG.mdfile./mlxtend/feature_selection/tests/directory (4 new tests)mlxtend/docs/sources/(deferred — the docstring covers the new parameter; happy to add a notebook example if desired)PYTHONPATH='.' pytest ./mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -sv— 27/27 passedflake8 ./mlxtend(clean) andblack --check(clean after auto-format)Reproduce BEFORE/AFTER yourself (copy-paste)
What I ran locally
PYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q→ 27/27 passedPYTHONPATH=. pytest mlxtend/feature_selection/tests/test_exhaustive_feature_selector.py -q -k issue_610→ 4/4 passedorigin/master'sexhaustive_feature_selector.py: 4/4 fail withTypeError: get_metric_dict() got an unexpected keyword argument 'top_k'flake8+black --checkon touched files → clean (after applyingblack)Edge cases tested
top_kreturns the K highest-scoring subsetstop_k=3after fitting on iris with 1..3 featuressorted(full, key=avg_score)[:3]test_get_metric_dict_top_k_returns_top_subsets_issue_610top_k=None= unchanged behaviourNonetest_get_metric_dict_top_k_none_preserves_default_behavior_issue_610top_kraisesValueError0,-2,1.5ValueError: top_k must be a positive integer or Nonetest_get_metric_dict_top_k_invalid_raises_issue_610top_klarger than evaluated counttop_k=1_000_000test_get_metric_dict_top_k_larger_than_total_returns_all_issue_610Risk / blast radius
Minimal. The new argument defaults to
None, which is a no-op; every existing call site behaves identically. The validation only fires whentop_kis provided, and the sort+slice is O(N log N) over the same dict that was already being constructed — no new memory pressure for the default path.Release note
PR drafted with assistance from Claude Code. The change was reviewed manually against rasbt/mlxtend's source. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.