Skip to content

silhouette_avg() metric has incorrect direction: should be "maximize" instead of "zero" #212

@dnldelarosa

Description

@dnldelarosa

Problem Description

The silhouette_avg() metric in tidyclust is currently defined with direction ="zero", which causes show_best() and select_best() from the tune package to return the worst performing models instead of the best ones.

Current Behavior

When using show_best() with silhouette_avg, the function sorts results by abs(mean) (values closest to zero first), which is incorrect for the silhouette metric. This means that if you have silhouette values like [0.8, 0.1, -0.2, 0.05], show_best() will return them in the order: [0.05, 0.1, -0.2, 0.8], showing the worst models first.

Expected Behavior

The silhouette coefficient ranges from -1 to 1, where:

  • Higher values (close to 1) = better clustering
  • Lower values (close to -1) = worse clustering
  • Values close to 0 = ambiguous clustering

Therefore, silhouette_avg() should use direction = "maximize" to ensure show_best() correctly sorts by desc(mean), showing the best models first.

Root Cause

In R/metric-silhouette.R at line 106:

  silhouette_avg <- new_cluster_metric(
    silhouette_avg,
    direction = "zero"  # ← This should be "maximize"
  )

Evidence from tune Package

Looking at tune::show_best(), the sorting logic is:

  if (metric_info$direction == "maximize") {
    summary_res <- summary_res |> dplyr::arrange(dplyr::desc(mean))
  } else if (metric_info$direction == "minimize") {
    summary_res <- summary_res |> dplyr::arrange(mean)
  } else if (metric_info$direction == "zero") {
    summary_res <- summary_res |> dplyr::arrange(abs(mean))  # ← Problem for silhouette
  }

Comparison with Other Metrics

Other clustering metrics in tidyclust correctly use direction = "zero" because they represent error/distance measures:

  • sse_within_total() - lower is better (minimize sum of squared errors)
  • sse_total() - lower is better
  • sse_ratio() - lower ratio is better

But silhouette_avg() is fundamentally different - it's a quality measure where higher values indicate better clustering.

Proposed Solution

Change line 106 in R/metric-silhouette.R from:

direction = "zero"

to:

direction = "maximize"

Reproducible Example

library(tidymodels)
library(tidyclust)
library(dplyr)

# Prepare the iris data
iris_numeric <- iris %>%
  select(where(is.numeric))

# Create a recipe to scale the data
iris_recipe <- recipe(~., data = iris_numeric) %>%
  step_normalize(all_numeric_predictors())

# Prepare and bake the recipe
prepared_recipe <- prep(iris_recipe)
iris_scaled <- bake(prepared_recipe, new_data = NULL)

# Create cross-validation folds
set.seed(123)
iris_folds <- rsample::vfold_cv(iris_scaled, v = 5)

# Tunable hierarchical model specification
hc_spec_tuned <- hier_clust(
    num_clusters = tune(),
    linkage_method = tune()
  ) %>%
  set_engine("stats")

hc_grid <- grid_regular(
  num_clusters(range = c(2, 5)),
  linkage_method(values = c("complete", "average", "ward.D2")),
  levels = c(4, 3)
) %>% 
  rename(linkage_method = activation) # For some reason, dials incorrectly names this parameter. I'm going to report that separately.

  
hc_tuning_results <- tune_cluster(
  hc_spec_tuned,
  ~ .,
  resamples = iris_folds,
  grid = hc_grid,
  metrics = cluster_metric_set(silhouette_avg)
)

show_best(hc_tuning_results, metric = "silhouette_avg")

Impact

This bug affects any workflow using tune_cluster() or tune_grid() with silhouette_avg as the optimization metric, leading to selection of suboptimal
clustering models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions