Skip to content

Fix sorting order based on order_column condition#46

Closed
jonaslandsgesell wants to merge 1 commit into
sherbold:masterfrom
jonaslandsgesell:jonaslandsgesell-change-sorting
Closed

Fix sorting order based on order_column condition#46
jonaslandsgesell wants to merge 1 commit into
sherbold:masterfrom
jonaslandsgesell:jonaslandsgesell-change-sorting

Conversation

@jonaslandsgesell
Copy link
Copy Markdown

@jonaslandsgesell jonaslandsgesell commented May 6, 2026

Disclaimer: Claude generated code.

Observation: sorting according to "higher raw values is better" results in unexpected ordering in latex tables (the entity with the worst rank is at the top and used as baseline for effect size calculation)

image

The autorank library currently couples the input metric direction with the output dataframe sorting, leading to incorrect baseline selection in _util.py.

Current Mechanism: _create_result_df_skeleton uses the asc boolean (derived from the order parameter) to sort the final rankdf.

The Conflict: When order='descending' (higher-is-better), asc is set to False. The code then executes .sort_values(by='meanrank', ascending=False).

Result: The model with the highest mean rank (the worst performer) is placed at the first index.

Consequence: Because the first index serves as the control group, effect sizes and post-hoc tests are incorrectly calculated relative to the worst model instead of the best.

Proposed Fix:
Force meanrank to always sort in ascending order (lowest rank first) regardless of the raw metric's direction:

Minimal script which passes with the proposed fix but does not pass without the fix:

import numpy as np
import pandas as pd
from autorank import autorank

rng = np.random.default_rng(42)
n_datasets = 20

# Test 1: Higher-is-better (like R²)
print("=" * 60)
print("TEST 1: Higher-is-better (R²) with order='descending'")
print("=" * 60)
data_hib = pd.DataFrame({
    "best":   rng.uniform(0.85, 1.00, n_datasets),
    "middle": rng.uniform(0.60, 0.80, n_datasets),
    "worst":  rng.uniform(0.20, 0.45, n_datasets),
})

print("\nInput: 3 models, higher value = better (like R²)")
print(f"  best   median ≈ {data_hib['best'].median():.3f}")
print(f"  middle median ≈ {data_hib['middle'].median():.3f}")
print(f"  worst  median ≈ {data_hib['worst'].median():.3f}")

result = autorank(data_hib, alpha=0.05, order='descending')
rdf = result.rankdf

central = 'mean' if 'mean' in rdf.columns else 'median'
print("\nautorank rankdf (order='descending'):")
print(rdf[['meanrank', central, 'effect_size', 'magnitude']].to_string())

print()
best_first = rdf.index[0]
print(f"rankdf.index[0] (baseline for effect size): '{best_first}'")

if best_first == 'best' and rdf.at['best', 'effect_size'] == 0.0:
    print("✓ CORRECT — best model is first, effect_size=0 for 'best'")
else:
    print("✗ BUG     — wrong model is first; effect sizes relative to wrong baseline!")
    print()
    print("Expected order : best  → middle → worst")
    print("Actual order   :", " → ".join(rdf.index.tolist()))


The autorank library currently couples the input metric direction with the output dataframe sorting, leading to incorrect baseline selection in _util.py.

Current Mechanism: _create_result_df_skeleton uses the asc boolean (derived from the order parameter) to sort the final rankdf.

The Conflict: When order='descending' (higher-is-better), asc is set to False. The code then executes .sort_values(by='meanrank', ascending=False).

Result: The model with the highest mean rank (the worst performer) is placed at the first index.

Consequence: Because the first index serves as the control group, effect sizes and post-hoc tests are incorrectly calculated relative to the worst model instead of the best.

Proposed Fix:
Force meanrank to always sort in ascending order (lowest rank first) regardless of the raw metric's direction:


Minimal script which passes with the proposed fix but does not pass without the fix:

```
import numpy as np
import pandas as pd
from autorank import autorank

rng = np.random.default_rng(42)
n_datasets = 20

# Test 1: Higher-is-better (like R²)
print("=" * 60)
print("TEST 1: Higher-is-better (R²) with order='descending'")
print("=" * 60)
data_hib = pd.DataFrame({
    "best":   rng.uniform(0.85, 1.00, n_datasets),
    "middle": rng.uniform(0.60, 0.80, n_datasets),
    "worst":  rng.uniform(0.20, 0.45, n_datasets),
})

print("\nInput: 3 models, higher value = better (like R²)")
print(f"  best   median ≈ {data_hib['best'].median():.3f}")
print(f"  middle median ≈ {data_hib['middle'].median():.3f}")
print(f"  worst  median ≈ {data_hib['worst'].median():.3f}")

result = autorank(data_hib, alpha=0.05, order='descending')
rdf = result.rankdf

central = 'mean' if 'mean' in rdf.columns else 'median'
print("\nautorank rankdf (order='descending'):")
print(rdf[['meanrank', central, 'effect_size', 'magnitude']].to_string())

print()
best_first = rdf.index[0]
print(f"rankdf.index[0] (baseline for effect size): '{best_first}'")

if best_first == 'best' and rdf.at['best', 'effect_size'] == 0.0:
    print("✓ CORRECT — best model is first, effect_size=0 for 'best'")
else:
    print("✗ BUG     — wrong model is first; effect sizes relative to wrong baseline!")
    print()
    print("Expected order : best  → middle → worst")
    print("Actual order   :", " → ".join(rdf.index.tolist()))


```
@sherbold
Copy link
Copy Markdown
Owner

Thanks for the PR. I do not have time for a review right now, but wanted to let you know already that I have seen this and hopefully get around to this next week.

@sherbold
Copy link
Copy Markdown
Owner

Yes, this is indeed undesirable behavior in the Latex table generation. However, the suggested fix is a good demonstration that Claude is often cannot understand broader considerations. It breaks the complete sorting logic, effectively rendering the sorting parameter useless for meanrank based sorting, without even documenting this.

A better solution would be to give a warning of some sorts in the latex generation, possibly suggesting to order ascending instead. Additionally, a new parameter could be introduced in the latex table generation, to facilitate descending sorting but comparison of effect sizes to the best model, e.g., to which row, the effect sizes, etc. should be reported (index as integer in Python logic, such that -1 is the last row).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants