-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Problem Description
The silhouette_avg() metric in tidyclust is currently defined with direction ="zero", which causes show_best() and select_best() from the tune package to return the worst performing models instead of the best ones.
Current Behavior
When using show_best() with silhouette_avg, the function sorts results by abs(mean) (values closest to zero first), which is incorrect for the silhouette metric. This means that if you have silhouette values like [0.8, 0.1, -0.2, 0.05], show_best() will return them in the order: [0.05, 0.1, -0.2, 0.8], showing the worst models first.
Expected Behavior
The silhouette coefficient ranges from -1 to 1, where:
- Higher values (close to 1) = better clustering
- Lower values (close to -1) = worse clustering
- Values close to 0 = ambiguous clustering
Therefore, silhouette_avg() should use direction = "maximize" to ensure show_best() correctly sorts by desc(mean), showing the best models first.
Root Cause
In R/metric-silhouette.R at line 106:
silhouette_avg <- new_cluster_metric(
silhouette_avg,
direction = "zero" # ← This should be "maximize"
)Evidence from tune Package
Looking at tune::show_best(), the sorting logic is:
if (metric_info$direction == "maximize") {
summary_res <- summary_res |> dplyr::arrange(dplyr::desc(mean))
} else if (metric_info$direction == "minimize") {
summary_res <- summary_res |> dplyr::arrange(mean)
} else if (metric_info$direction == "zero") {
summary_res <- summary_res |> dplyr::arrange(abs(mean)) # ← Problem for silhouette
}Comparison with Other Metrics
Other clustering metrics in tidyclust correctly use direction = "zero" because they represent error/distance measures:
sse_within_total()- lower is better (minimize sum of squared errors)sse_total()- lower is bettersse_ratio()- lower ratio is better
But silhouette_avg() is fundamentally different - it's a quality measure where higher values indicate better clustering.
Proposed Solution
Change line 106 in R/metric-silhouette.R from:
direction = "zero"
to:
direction = "maximize"
Reproducible Example
library(tidymodels)
library(tidyclust)
library(dplyr)
# Prepare the iris data
iris_numeric <- iris %>%
select(where(is.numeric))
# Create a recipe to scale the data
iris_recipe <- recipe(~., data = iris_numeric) %>%
step_normalize(all_numeric_predictors())
# Prepare and bake the recipe
prepared_recipe <- prep(iris_recipe)
iris_scaled <- bake(prepared_recipe, new_data = NULL)
# Create cross-validation folds
set.seed(123)
iris_folds <- rsample::vfold_cv(iris_scaled, v = 5)
# Tunable hierarchical model specification
hc_spec_tuned <- hier_clust(
num_clusters = tune(),
linkage_method = tune()
) %>%
set_engine("stats")
hc_grid <- grid_regular(
num_clusters(range = c(2, 5)),
linkage_method(values = c("complete", "average", "ward.D2")),
levels = c(4, 3)
) %>%
rename(linkage_method = activation) # For some reason, dials incorrectly names this parameter. I'm going to report that separately.
hc_tuning_results <- tune_cluster(
hc_spec_tuned,
~ .,
resamples = iris_folds,
grid = hc_grid,
metrics = cluster_metric_set(silhouette_avg)
)
show_best(hc_tuning_results, metric = "silhouette_avg")Impact
This bug affects any workflow using tune_cluster() or tune_grid() with silhouette_avg as the optimization metric, leading to selection of suboptimal
clustering models.