Skip to content

Optimization of the OptiSim algorithm (For review)#289

Closed
dhruvDev23 wants to merge 1 commit intotheochem:mainfrom
dhruvDev23:intial_optimization_OptiSim
Closed

Optimization of the OptiSim algorithm (For review)#289
dhruvDev23 wants to merge 1 commit intotheochem:mainfrom
dhruvDev23:intial_optimization_OptiSim

Conversation

@dhruvDev23
Copy link
Contributor

This PR optimizes the algorithm() method of OptiSim class in the file selector/methods/distance.py.

Original Implementation

  • In the Original implementation, a kd-tree was rebuild from all selected points on every iteration of the selection loop.
  • For a Selection of k candidates, the kd-tree was being rebuilt k times, making the operation expensive as more points are selected.

After Optimization

  • Removed the kd-tree rebuilt from the loop.
  • Implemented a min_dists array that stores the minimum distances between each point and nearest selected point.
  • Then Select and append the candidate farthest from its nearest selected point using min_dists array.
  • Then update the min_dists array after selection of the new candidate.

All Test Cases Passed

Analysis of different input data shapes

Before Optimization
Screenshot 2026-02-18 at 7 09 25 PM

After Optimization
Screenshot 2026-02-18 at 7 10 51 PM

@dhruvDev23 dhruvDev23 force-pushed the intial_optimization_OptiSim branch from ea6063b to e41516f Compare February 18, 2026 18:30
@marco-2023
Copy link
Collaborator

Hi @dhruvDev23 thank you very much for your PR.

I ran a slightly bigger example:

import time
import numpy as np
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = DISE(ref_index=0, p=2)

# number of runs
n_runs = 10
times = []

for _ in range(n_runs):
    start = time.perf_counter()
    collector.select(X_dist, size=25)
    end = time.perf_counter()
    times.append(end - start)

times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev:      {times.std():.4f} s")
print(f"Min runtime:  {times.min():.4f} s")
print(f"Max runtime:  {times.max():.4f} s")

For the original code, the times were:

Std dev:      0.1512 s
Min runtime:  10.5704 s
Max runtime:  11.0648 s

for this PR:

Std dev:      0.1375 s
Min runtime:  10.5764 s
Max runtime:  10.9690 s

Unfortunately, this PR does not improve performance in a meaningful way, nor does it enhance the readability or simplicity of the current codebase. I profiled the example with cProfile:

import cProfile
import pstats
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = DISE(ref_index=0, p=2)

# profile the select function
pr = cProfile.Profile()
pr.enable()

collector.select(X_dist, size=25)

pr.disable()

# print stats sorted by cumulative time
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats("cumtime")  # can also use "tottime"
stats.print_stats(30)        # top 30 functions

The results indicate that most of the time is spent by the optimize_radius function in utils. I would try to optimize the code there. instead.


   Ordered by: cumulative time
   List reduced from 97 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   11.039    5.519 interactiveshell.py:3663(run_code)
        2    0.000    0.000   11.039    5.519 {built-in method builtins.exec}
        1    0.000    0.000   11.039   11.039 3200550339.py:1(<module>)
        1    0.000    0.000   11.039   11.039 base.py:38(select)
        1    0.000    0.000   11.039   11.039 distance.py:601(select_from_cluster)
        1    0.000    0.000   11.039   11.039 utils.py:35(optimize_radius)
       10    0.004    0.000   11.032    1.103 distance.py:519(algorithm)
       10    0.000    0.000   10.134    1.013 distance.py:1816(pdist)
       10    0.000    0.000   10.134    1.013 _lazy.py:57(lazy_apply)
       10    0.000    0.000   10.133    1.013 _lazy.py:333(wrapper)
       10    0.000    0.000   10.133    1.013 distance.py:2110(_np_pdist)
       10   10.133    1.013   10.133    1.013 {built-in method scipy.spatial._distance_pybind.pdist_minkowski}
      220    0.662    0.003    0.663    0.003 _kdtree.py:486(query_ball_point)
       10    0.130    0.013    0.160    0.016 _kdtree.py:359(__init__)
       10    0.000    0.000    0.069    0.007 distance.py:2154(squareform)
       10    0.063    0.006    0.063    0.006 {built-in method scipy.spatial._distance_wrap.to_squareform_from_vector_wrap}
      270    0.037    0.000    0.037    0.000 {method 'reduce' of 'numpy.ufunc' objects}
       20    0.000    0.000    0.027    0.001 fromnumeric.py:66(_wrapreduction)
       10    0.000    0.000    0.014    0.001 fromnumeric.py:3127(amax)
       10    0.000    0.000    0.014    0.001 fromnumeric.py:3265(amin)
        1    0.000    0.000    0.007    0.007 fromnumeric.py:2921(ptp)
        1    0.000    0.000    0.007    0.007 _methods.py:231(_ptp)
       20    0.006    0.000    0.006    0.000 {built-in method numpy.zeros}
      230    0.000    0.000    0.003    0.000 _methods.py:64(_all)
       10    0.000    0.000    0.001    0.000 distance.py:633(get_initial_selection)
       10    0.000    0.000    0.001    0.000 fromnumeric.py:1100(argsort)
       10    0.000    0.000    0.001    0.000 fromnumeric.py:48(_wrapfunc)
       10    0.001    0.000    0.001    0.000 {method 'argsort' of 'numpy.ndarray' objects}
       18    0.000    0.000    0.000    0.000 fromnumeric.py:2436(any)
       18    0.000    0.000    0.000    0.000 fromnumeric.py:86(_wrapreduction_any_all)

@marco-2023
Copy link
Collaborator

Because of the previous reasons, I will be closing this PR.

@marco-2023 marco-2023 closed this Feb 25, 2026
@dhruvDev23
Copy link
Contributor Author

dhruvDev23 commented Feb 25, 2026

Hi @marco-2023 ,

Thank you for reviewing my PR and for profiling with the example. I’d like to clarify a point regarding the benchmarking.
Your benchmarks tested DISE, and not OptiSim. In this PR, I have optimized the algorithm method specifically for the OptiSim class. That's why the benchmarks showed the similar results in both cases.

I have used the same example to measure the runtime for the OptiSim class

import time
import numpy as np
from selector.methods.distance import OptiSim
from sklearn.datasets import make_blobs

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

# number of runs
n_runs = 10

def bench(collector, data, label):
    times = []
    for _ in range(n_runs):
        collector.r = collector.r0
        start = time.perf_counter()
        collector.select(data, size=25)
        end = time.perf_counter()
        times.append(end - start)
    times = np.array(times)
    
    print(f"Mean runtime: {times.mean():.4f} s")
    print(f"Std dev:      {times.std():.4f} s")
    print(f"Min runtime:  {times.min():.4f} s")
    print(f"Max runtime:  {times.max():.4f} s")


bench(OptiSim(ref_index=0, p=2), X, "OptiSim")

Before Optimization

Mean runtime: 0.0318 s
Std dev:      0.0044 s
Min runtime:  0.0291 s
Max runtime:  0.0442 s

After Optimization

Mean runtime: 0.0068 s
Std dev:      0.0002 s
Min runtime:  0.0067 s
Max runtime:  0.0074 s

Please let me know if this approach seems reasonable.

and as the profiling results show that most of the time is spent by the optimize_radius function in utils. I have been working on various strategies to optimize this function as well. and along with that I am also interested in improving the performance of the DISE algorithm as the stretch goal of GSOC 2026: Improve efficiency of OptiSim selection method says.

I would appreciate any strategies/ideas you would suggest for the Optimization.

Thanks again for your help,
Dhruv

@marco-2023
Copy link
Collaborator

Hi @dhruvDev23 Thank you for pointing it out. At the same time, I find your results interesting.
When I run:

import time
import numpy as np
from selector.methods.distance import DISE, OptiSim
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances

# generate sample data
n = 2000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = OptiSim(ref_index=0, p=2)
# number of runs
n_runs = 10
times = []

for _ in range(n_runs):
    start = time.perf_counter()
    collector.select(X_dist, size=25)
    end = time.perf_counter()
    times.append(end - start)

times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev:      {times.std():.4f} s")
print(f"Min runtime:  {times.min():.4f} s")
print(f"Max runtime:  {times.max():.4f} s")

The times I get are:
Original:

Mean runtime: 1.3750 s
Std dev:      0.0347 s
Min runtime:  1.3377 s
Max runtime:  1.4689 s

This branch

Mean runtime: 3.2195 s
Std dev:      0.1107 s
Min runtime:  3.1004 s
Max runtime:  3.3963 s

which is an opposite trend. This remains the case if I select a sample of 50 instead.
original time:

Mean runtime: 2.3346 s
Std dev:      0.0556 s
Min runtime:  2.2473 s
Max runtime:  2.4303 s

This branch:

Std dev:      0.0575 s
Min runtime:  5.5108 s
Max runtime:  5.7057 s```

@marco-2023
Copy link
Collaborator

I ran the same profile for Optisim (although with bigger data and sample sizes):

import cProfile
import pstats
from selector.methods.distance import DISE, OptiSim
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np

# generate sample data
n = 5000
X, labels = make_blobs(
    n_samples=n,
    n_features=2,
    centers=np.array([[0.0, 0.0]]),
    random_state=42,
)

X_dist = pairwise_distances(X, metric="euclidean")

collector = OptiSim(ref_index=0, p=2)

# profile the select function
pr = cProfile.Profile()
pr.enable()

collector.select(X_dist, size=200)

pr.disable()

# print stats sorted by cumulative time
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats("cumtime")  # can also use "tottime"
stats.print_stats(30)        # top 30 functions

The results point to the query_ball_point calls as the time-consuming step. I would recommend trying refactor the algorithm method to decrease query_ball_point use or its input.

        1274341 function calls (1272157 primitive calls) in 37.041 seconds

   Ordered by: cumulative time
   List reduced from 176 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000   37.041   18.521 interactiveshell.py:3663(run_code)
        2    0.000    0.000   37.041   18.521 {built-in method builtins.exec}
        1    0.000    0.000   37.041   37.041 base.py:38(select)
        1    0.000    0.000   37.041   37.041 distance.py:411(select_from_cluster)
        1    0.000    0.000   37.041   37.041 utils.py:35(optimize_radius)
       11    0.904    0.082   37.022    3.366 distance.py:339(algorithm)
     2195   28.383    0.013   28.398    0.013 _kdtree.py:486(query_ball_point)
     2204    3.321    0.002    4.251    0.002 _kdtree.py:359(__init__)
     2193    0.205    0.000    3.193    0.001 _kdtree.py:369(query)
    21871    0.039    0.000    2.307    0.000 threading.py:938(start)
   109356    2.230    0.000    2.230    0.000 {method 'acquire' of '_thread.lock' objects}
    21871    0.031    0.000    1.985    0.000 threading.py:604(wait)
    21871    0.044    0.000    1.933    0.000 threading.py:288(wait)
    15388    0.953    0.000    0.953    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     8794    0.026    0.000    0.912    0.000 fromnumeric.py:66(_wrapreduction)
     2204    0.004    0.000    0.505    0.000 fromnumeric.py:3127(amax)
    21871    0.017    0.000    0.427    0.000 threading.py:1080(join)
    21872    0.014    0.000    0.400    0.000 threading.py:1118(_wait_for_tstate_lock)
     2204    0.003    0.000    0.365    0.000 fromnumeric.py:3265(amin)
...
     6579    0.009    0.000    0.044    0.000 fromnumeric.py:48(_wrapfunc)
     2193    0.004    0.000    0.042    0.000 fromnumeric.py:3287(prod)

@dhruvDev23
Copy link
Contributor Author

Hi @marco-2023 ,

You're right. Sorry, I didn't benchmark with the pairwise distance matrix instead only tested with the raw feature data. I can see that the base class SelectionBase.select() explicitly supports inputs in both formats.

Looking at the profiling results, there are two issues:

  • np.linalg.norm(x - x[idx]) over a (5000 × 5000) distance matrix treats each row as a 5000-dimensional vector. making it very slow.
  • query_ball_point is consuming a lot of time.

I'll work on both of these issue and share a solution in a few days.

Thanks for your patience!

Dhruv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants