You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
utilizes 16% of the cpu during this step, 1 core at 100% util (subgraph at right top) (the spikes in other graphs are due to the CNN step executed right before this)
which is a lot of time for my use case. I am looking for ways to make this step faster (ideally, <75s). Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping. It does make the speed go from 180s to 75s, but the result isn't desirable since it only takes into account distance of a single image.
Do you have any insights on how can this step be made faster?
`
class KCenterGreedy:
def __init__(self, X, y, seed, metric="euclidean"):
self.X = X
self.y = y
self.name = "kcenter"
self.metric = metric
self.min_distances = None
self.n_obs = self.X.shape[0]
self.already_selected = []
def update_distances(self, cluster_centers, only_new=True, reset_dist=False):
if reset_dist:
self.min_distances = None
if only_new:
cluster_centers = [d for d in cluster_centers if d not in self.already_selected]
if cluster_centers:
# Update min_distances for all examples given new cluster center.
x = self.features[cluster_centers]
dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric)
if self.min_distances is None:
self.min_distances = np.min(dist, axis=1).reshape(-1,1)
else:
self.min_distances = np.minimum(self.min_distances, dist)
def select_batch(self, model, already_selected, N, **kwargs):
# Assumes that the transform function takes in original data and not flattened data.
if model is not None: self.features = model.transform(self.X)
else : self.features = self.X.reshape((self.X.shape[0], -1))
# Compute distances.
self.update_distances(already_selected, only_new=False, reset_dist=True)
# Initialize sampling results.
new_batch = []
for _ in rich.progress.track(range(N), description="Sampling..."):
# Initialize centers with a randomly selected datapoint
if self.already_selected is None:
ind = np.random.choice(np.arange(self.n_obs))
# Otherwise, use the index of minimum distance.
else:
ind = np.argmax(self.min_distances)
# New examples should not be in already selected since those points
# should have min_distance of zero to a cluster center.
assert ind not in already_selected
self.update_distances([ind], only_new=True, reset_dist=False)
new_batch.append(ind)
# Memorize the already selected indices.
self.already_selected = already_selected
return new_batch
`
The text was updated successfully, but these errors were encountered:
Thank you for sending the issue! Yeah, your point is very good. As for the KCenterGreedy, I just used Google's original implementation and did not apply any improvement.
Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping.
Yes, you're right. But I think we can parallelize the update of distance metric after selecting a cluster center. To be specific,
In the above code, I specified 4 as a number of parallelization, but you seem to have a lot of CPU cores, so you can increase the number. If you already specified the parameter of joblib on the other place, the above modification may have no effect, sorry...
The kCenterGreedy step takes ~180s on;
which is a lot of time for my use case. I am looking for ways to make this step faster (ideally, <75s). Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping. It does make the speed go from 180s to 75s, but the result isn't desirable since it only takes into account distance of a single image.
Do you have any insights on how can this step be made faster?
`
`
The text was updated successfully, but these errors were encountered: