Optimization of the OptiSim algorithm (For review)#289
Optimization of the OptiSim algorithm (For review)#289dhruvDev23 wants to merge 1 commit intotheochem:mainfrom
Conversation
ea6063b to
e41516f
Compare
|
Hi @dhruvDev23 thank you very much for your PR. I ran a slightly bigger example: import time
import numpy as np
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
# generate sample data
n = 2000
X, labels = make_blobs(
n_samples=n,
n_features=2,
centers=np.array([[0.0, 0.0]]),
random_state=42,
)
X_dist = pairwise_distances(X, metric="euclidean")
collector = DISE(ref_index=0, p=2)
# number of runs
n_runs = 10
times = []
for _ in range(n_runs):
start = time.perf_counter()
collector.select(X_dist, size=25)
end = time.perf_counter()
times.append(end - start)
times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev: {times.std():.4f} s")
print(f"Min runtime: {times.min():.4f} s")
print(f"Max runtime: {times.max():.4f} s")For the original code, the times were: for this PR: Unfortunately, this PR does not improve performance in a meaningful way, nor does it enhance the readability or simplicity of the current codebase. I profiled the example with import cProfile
import pstats
from selector.methods.distance import DISE
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
import numpy as np
# generate sample data
n = 2000
X, labels = make_blobs(
n_samples=n,
n_features=2,
centers=np.array([[0.0, 0.0]]),
random_state=42,
)
X_dist = pairwise_distances(X, metric="euclidean")
collector = DISE(ref_index=0, p=2)
# profile the select function
pr = cProfile.Profile()
pr.enable()
collector.select(X_dist, size=25)
pr.disable()
# print stats sorted by cumulative time
stats = pstats.Stats(pr)
stats.strip_dirs()
stats.sort_stats("cumtime") # can also use "tottime"
stats.print_stats(30) # top 30 functionsThe results indicate that most of the time is spent by the |
|
Because of the previous reasons, I will be closing this PR. |
|
Hi @marco-2023 , Thank you for reviewing my PR and for profiling with the example. I’d like to clarify a point regarding the benchmarking. I have used the same example to measure the runtime for the OptiSim class import time
import numpy as np
from selector.methods.distance import OptiSim
from sklearn.datasets import make_blobs
# generate sample data
n = 2000
X, labels = make_blobs(
n_samples=n,
n_features=2,
centers=np.array([[0.0, 0.0]]),
random_state=42,
)
# number of runs
n_runs = 10
def bench(collector, data, label):
times = []
for _ in range(n_runs):
collector.r = collector.r0
start = time.perf_counter()
collector.select(data, size=25)
end = time.perf_counter()
times.append(end - start)
times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev: {times.std():.4f} s")
print(f"Min runtime: {times.min():.4f} s")
print(f"Max runtime: {times.max():.4f} s")
bench(OptiSim(ref_index=0, p=2), X, "OptiSim")Before Optimization After Optimization Please let me know if this approach seems reasonable. and as the profiling results show that most of the time is spent by the I would appreciate any strategies/ideas you would suggest for the Optimization. Thanks again for your help, |
|
Hi @dhruvDev23 Thank you for pointing it out. At the same time, I find your results interesting. import time
import numpy as np
from selector.methods.distance import DISE, OptiSim
from sklearn.datasets import make_blobs
from sklearn.metrics.pairwise import pairwise_distances
# generate sample data
n = 2000
X, labels = make_blobs(
n_samples=n,
n_features=2,
centers=np.array([[0.0, 0.0]]),
random_state=42,
)
X_dist = pairwise_distances(X, metric="euclidean")
collector = OptiSim(ref_index=0, p=2)
# number of runs
n_runs = 10
times = []
for _ in range(n_runs):
start = time.perf_counter()
collector.select(X_dist, size=25)
end = time.perf_counter()
times.append(end - start)
times = np.array(times)
print(f"Mean runtime: {times.mean():.4f} s")
print(f"Std dev: {times.std():.4f} s")
print(f"Min runtime: {times.min():.4f} s")
print(f"Max runtime: {times.max():.4f} s")The times I get are: This branch which is an opposite trend. This remains the case if I select a sample of 50 instead. This branch: |
|
I ran the same profile for Optisim (although with bigger data and sample sizes): The results point to the |
|
Hi @marco-2023 , You're right. Sorry, I didn't benchmark with the pairwise distance matrix instead only tested with the raw feature data. I can see that the base class Looking at the profiling results, there are two issues:
I'll work on both of these issue and share a solution in a few days. Thanks for your patience! Dhruv |
This PR optimizes the
algorithm()method ofOptiSimclass in the fileselector/methods/distance.py.Original Implementation
After Optimization
min_distsarray that stores the minimum distances between each point and nearest selected point.min_distsarray.min_distsarray after selection of the new candidate.All Test Cases Passed
Analysis of different input data shapes
Before Optimization

After Optimization
