Optimization of the OptiSim algorithm (For review) by dhruvDev23 · Pull Request #289 · theochem/Selector

dhruvDev23 · 2026-02-18T13:47:58Z

This PR optimizes the algorithm() method of OptiSim class in the file selector/methods/distance.py.

Original Implementation

In the Original implementation, a kd-tree was rebuild from all selected points on every iteration of the selection loop.
For a Selection of k candidates, the kd-tree was being rebuilt k times, making the operation expensive as more points are selected.

After Optimization

Removed the kd-tree rebuilt from the loop.
Implemented a min_dists array that stores the minimum distances between each point and nearest selected point.
Then Select and append the candidate farthest from its nearest selected point using min_dists array.
Then update the min_dists array after selection of the new candidate.

All Test Cases Passed

Analysis of different input data shapes

Before Optimization

After Optimization

marco-2023 · 2026-02-25T16:13:20Z

Hi @dhruvDev23 thank you very much for your PR.

I ran a slightly bigger example:

import time
import numpy as np
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = DISE(ref_index=0, p=2)

# number of runs
n_runs = 10
times = []

for _ in range(n_runs):
    start = time.perf_counter()
    collector.select(X_dist, size=25)
    end = time.perf_counter()
    times.append(end - start)

times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev:      {times.std():.4f} s")
print(f"Min runtime:  {times.min():.4f} s")
print(f"Max runtime:  {times.max():.4f} s")

For the original code, the times were:

Std dev:      0.1512 s
Min runtime:  10.5704 s
Max runtime:  11.0648 s

for this PR:

Std dev:      0.1375 s
Min runtime:  10.5764 s
Max runtime:  10.9690 s

Unfortunately, this PR does not improve performance in a meaningful way, nor does it enhance the readability or simplicity of the current codebase. I profiled the example with cProfile:

import cProfile
import pstats
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = DISE(ref_index=0, p=2)

# profile the select function
pr = cProfile.Profile()
pr.enable()

collector.select(X_dist, size=25)

pr.disable()

# print stats sorted by cumulative time
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats("cumtime")  # can also use "tottime"
stats.print_stats(30)        # top 30 functions

The results indicate that most of the time is spent by the optimize_radius function in utils. I would try to optimize the code there. instead.


   Ordered by: cumulative time
   List reduced from 97 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   11.039    5.519 interactiveshell.py:3663(run_code)
        2    0.000    0.000   11.039    5.519 {built-in method builtins.exec}
        1    0.000    0.000   11.039   11.039 3200550339.py:1(<module>)
        1    0.000    0.000   11.039   11.039 base.py:38(select)
        1    0.000    0.000   11.039   11.039 distance.py:601(select_from_cluster)
        1    0.000    0.000   11.039   11.039 utils.py:35(optimize_radius)
       10    0.004    0.000   11.032    1.103 distance.py:519(algorithm)
       10    0.000    0.000   10.134    1.013 distance.py:1816(pdist)
       10    0.000    0.000   10.134    1.013 _lazy.py:57(lazy_apply)
       10    0.000    0.000   10.133    1.013 _lazy.py:333(wrapper)
       10    0.000    0.000   10.133    1.013 distance.py:2110(_np_pdist)
       10   10.133    1.013   10.133    1.013 {built-in method scipy.spatial._distance_pybind.pdist_minkowski}
      220    0.662    0.003    0.663    0.003 _kdtree.py:486(query_ball_point)
       10    0.130    0.013    0.160    0.016 _kdtree.py:359(__init__)
       10    0.000    0.000    0.069    0.007 distance.py:2154(squareform)
       10    0.063    0.006    0.063    0.006 {built-in method scipy.spatial._distance_wrap.to_squareform_from_vector_wrap}
      270    0.037    0.000    0.037    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       20    0.000    0.000    0.027    0.001 fromnumeric.py:66(_wrapreduction)
       10    0.000    0.000    0.014    0.001 fromnumeric.py:3127(amax)
       10    0.000    0.000    0.014    0.001 fromnumeric.py:3265(amin)
        1    0.000    0.000    0.007    0.007 fromnumeric.py:2921(ptp)
        1    0.000    0.000    0.007    0.007 _methods.py:231(_ptp)
       20    0.006    0.000    0.006    0.000 {built-in method numpy.zeros}
      230    0.000    0.000    0.003    0.000 _methods.py:64(_all)
       10    0.000    0.000    0.001    0.000 distance.py:633(get_initial_selection)
       10    0.000    0.000    0.001    0.000 fromnumeric.py:1100(argsort)
       10    0.000    0.000    0.001    0.000 fromnumeric.py:48(_wrapfunc)
       10    0.001    0.000    0.001    0.000 {method 'argsort' of 'numpy.ndarray' objects}
       18    0.000    0.000    0.000    0.000 fromnumeric.py:2436(any)
       18    0.000    0.000    0.000    0.000 fromnumeric.py:86(_wrapreduction_any_all)

marco-2023 · 2026-02-25T16:20:19Z

Because of the previous reasons, I will be closing this PR.

dhruvDev23 · 2026-02-25T18:06:06Z

Hi @marco-2023 ,

Thank you for reviewing my PR and for profiling with the example. I’d like to clarify a point regarding the benchmarking.
Your benchmarks tested DISE, and not OptiSim. In this PR, I have optimized the algorithm method specifically for the OptiSim class. That's why the benchmarks showed the similar results in both cases.

I have used the same example to measure the runtime for the OptiSim class

import time
import numpy as np
from selector.methods.distance import OptiSim
from sklearn.datasets import make_blobs

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

# number of runs
n_runs = 10

def bench(collector, data, label):
    times = []
    for _ in range(n_runs):
        collector.r = collector.r0
        start = time.perf_counter()
        collector.select(data, size=25)
        end = time.perf_counter()
        times.append(end - start)
    times = np.array(times)
    
    print(f"Mean runtime: {times.mean():.4f} s")
    print(f"Std dev:      {times.std():.4f} s")
    print(f"Min runtime:  {times.min():.4f} s")
    print(f"Max runtime:  {times.max():.4f} s")


bench(OptiSim(ref_index=0, p=2), X, "OptiSim")

Before Optimization

Mean runtime: 0.0318 s
Std dev:      0.0044 s
Min runtime:  0.0291 s
Max runtime:  0.0442 s

After Optimization

Mean runtime: 0.0068 s
Std dev:      0.0002 s
Min runtime:  0.0067 s
Max runtime:  0.0074 s

Please let me know if this approach seems reasonable.

and as the profiling results show that most of the time is spent by the optimize_radius function in utils. I have been working on various strategies to optimize this function as well. and along with that I am also interested in improving the performance of the DISE algorithm as the stretch goal of GSOC 2026: Improve efficiency of OptiSim selection method says.

I would appreciate any strategies/ideas you would suggest for the Optimization.

Thanks again for your help,
Dhruv

marco-2023 · 2026-02-25T18:47:25Z

Hi @dhruvDev23 Thank you for pointing it out. At the same time, I find your results interesting.
When I run:

import time
import numpy as np
from selector.methods.distance import DISE, OptiSim
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = OptiSim(ref_index=0, p=2)
# number of runs
n_runs = 10
times = []

for _ in range(n_runs):
    start = time.perf_counter()
    collector.select(X_dist, size=25)
    end = time.perf_counter()
    times.append(end - start)

times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev:      {times.std():.4f} s")
print(f"Min runtime:  {times.min():.4f} s")
print(f"Max runtime:  {times.max():.4f} s")

The times I get are:
Original:

Mean runtime: 1.3750 s
Std dev:      0.0347 s
Min runtime:  1.3377 s
Max runtime:  1.4689 s

This branch

Mean runtime: 3.2195 s
Std dev:      0.1107 s
Min runtime:  3.1004 s
Max runtime:  3.3963 s

which is an opposite trend. This remains the case if I select a sample of 50 instead.
original time:

Mean runtime: 2.3346 s
Std dev:      0.0556 s
Min runtime:  2.2473 s
Max runtime:  2.4303 s

This branch:

Std dev:      0.0575 s
Min runtime:  5.5108 s
Max runtime:  5.7057 s```

marco-2023 · 2026-02-25T19:11:38Z

I ran the same profile for Optisim (although with bigger data and sample sizes):

import cProfile
import pstats
from selector.methods.distance import DISE, OptiSim
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

# generate sample data
n = 5000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = OptiSim(ref_index=0, p=2)

# profile the select function
pr = cProfile.Profile()
pr.enable()

collector.select(X_dist, size=200)

pr.disable()

# print stats sorted by cumulative time
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats("cumtime")  # can also use "tottime"
stats.print_stats(30)        # top 30 functions

The results point to the query_ball_point calls as the time-consuming step. I would recommend trying refactor the algorithm method to decrease query_ball_point use or its input.

        1274341 function calls (1272157 primitive calls) in 37.041 seconds

   Ordered by: cumulative time
   List reduced from 176 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   37.041   18.521 interactiveshell.py:3663(run_code)
        2    0.000    0.000   37.041   18.521 {built-in method builtins.exec}
        1    0.000    0.000   37.041   37.041 base.py:38(select)
        1    0.000    0.000   37.041   37.041 distance.py:411(select_from_cluster)
        1    0.000    0.000   37.041   37.041 utils.py:35(optimize_radius)
       11    0.904    0.082   37.022    3.366 distance.py:339(algorithm)
     2195   28.383    0.013   28.398    0.013 _kdtree.py:486(query_ball_point)
     2204    3.321    0.002    4.251    0.002 _kdtree.py:359(__init__)
     2193    0.205    0.000    3.193    0.001 _kdtree.py:369(query)
    21871    0.039    0.000    2.307    0.000 threading.py:938(start)
   109356    2.230    0.000    2.230    0.000 {method 'acquire' of '_thread.lock' objects}
    21871    0.031    0.000    1.985    0.000 threading.py:604(wait)
    21871    0.044    0.000    1.933    0.000 threading.py:288(wait)
    15388    0.953    0.000    0.953    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     8794    0.026    0.000    0.912    0.000 fromnumeric.py:66(_wrapreduction)
     2204    0.004    0.000    0.505    0.000 fromnumeric.py:3127(amax)
    21871    0.017    0.000    0.427    0.000 threading.py:1080(join)
    21872    0.014    0.000    0.400    0.000 threading.py:1118(_wait_for_tstate_lock)
     2204    0.003    0.000    0.365    0.000 fromnumeric.py:3265(amin)
...
     6579    0.009    0.000    0.044    0.000 fromnumeric.py:48(_wrapfunc)
     2193    0.004    0.000    0.042    0.000 fromnumeric.py:3287(prod)

dhruvDev23 · 2026-02-25T20:50:21Z

Hi @marco-2023 ,

You're right. Sorry, I didn't benchmark with the pairwise distance matrix instead only tested with the raw feature data. I can see that the base class SelectionBase.select() explicitly supports inputs in both formats.

Looking at the profiling results, there are two issues:

np.linalg.norm(x - x[idx]) over a (5000 × 5000) distance matrix treats each row as a 5000-dimensional vector. making it very slow.
query_ball_point is consuming a lot of time.

I'll work on both of these issue and share a solution in a few days.

Thanks for your patience!

Dhruv

dhruvDev23 mentioned this pull request Feb 18, 2026

GSOC 2026: Improve efficiency of OptiSim selection method #285

Open

Optimization of OptiSim algorithm

e41516f

dhruvDev23 force-pushed the intial_optimization_OptiSim branch from ea6063b to e41516f Compare February 18, 2026 18:30

FanwangM requested review from FanwangM and marco-2023 February 21, 2026 01:59

marco-2023 closed this Feb 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of the OptiSim algorithm (For review)#289

Optimization of the OptiSim algorithm (For review)#289
dhruvDev23 wants to merge 1 commit intotheochem:mainfrom
dhruvDev23:intial_optimization_OptiSim

dhruvDev23 commented Feb 18, 2026

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

dhruvDev23 commented Feb 25, 2026 •

edited

Loading

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

dhruvDev23 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dhruvDev23 commented Feb 18, 2026

Original Implementation

After Optimization

Analysis of different input data shapes

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

dhruvDev23 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

marco-2023 commented Feb 25, 2026

Uh oh!

dhruvDev23 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dhruvDev23 commented Feb 25, 2026 •

edited

Loading