Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make kCenterGreedy faster #8

Open
DHAiRYA2048 opened this issue Mar 20, 2024 · 1 comment
Open

Make kCenterGreedy faster #8

DHAiRYA2048 opened this issue Mar 20, 2024 · 1 comment
Assignees

Comments

@DHAiRYA2048
Copy link

The kCenterGreedy step takes ~180s on;

  • a training set of 84 images of size 150x150
  • i7 processor, 20 logical cores
  • utilizes 16% of the cpu during this step, 1 core at 100% util (subgraph at right top) (the spikes in other graphs are due to the CNN step executed right before this)

image

which is a lot of time for my use case. I am looking for ways to make this step faster (ideally, <75s). Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping. It does make the speed go from 180s to 75s, but the result isn't desirable since it only takes into account distance of a single image.

Do you have any insights on how can this step be made faster?

`

class KCenterGreedy:

def __init__(self, X, y, seed, metric="euclidean"):
    self.X = X
    self.y = y
    self.name = "kcenter"
    self.metric = metric
    self.min_distances = None
    self.n_obs = self.X.shape[0]
    self.already_selected = []

def update_distances(self, cluster_centers, only_new=True, reset_dist=False):

    if reset_dist:
        self.min_distances = None

    if only_new:
        cluster_centers = [d for d in cluster_centers if d not in self.already_selected]

    if cluster_centers:

        # Update min_distances for all examples given new cluster center.
        x = self.features[cluster_centers]
        dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric)

        if self.min_distances is None:
          self.min_distances = np.min(dist, axis=1).reshape(-1,1)
        else:
          self.min_distances = np.minimum(self.min_distances, dist)


def select_batch(self, model, already_selected, N, **kwargs):
 
    # Assumes that the transform function takes in original data and not flattened data.
    if model is not None: self.features = model.transform(self.X)
    else                : self.features = self.X.reshape((self.X.shape[0], -1))

    # Compute distances.
    self.update_distances(already_selected, only_new=False, reset_dist=True)

    # Initialize sampling results.
    new_batch = []

    for _ in rich.progress.track(range(N), description="Sampling..."):

        # Initialize centers with a randomly selected datapoint
        if self.already_selected is None:
            ind = np.random.choice(np.arange(self.n_obs))

        # Otherwise, use the index of minimum distance.
        else:
            ind = np.argmax(self.min_distances)

        # New examples should not be in already selected since those points
        # should have min_distance of zero to a cluster center.
        assert ind not in already_selected

        self.update_distances([ind], only_new=True, reset_dist=False)

        new_batch.append(ind)

    # Memorize the already selected indices.
    self.already_selected = already_selected

    return new_batch

`

@tiskw tiskw self-assigned this Mar 22, 2024
@tiskw
Copy link
Owner

tiskw commented Mar 22, 2024

@DHAiRYA2048 ,

Thank you for sending the issue! Yeah, your point is very good. As for the KCenterGreedy, I just used Google's original implementation and did not apply any improvement.

Since each iteration requires access to the previous iteration's result, multiprocessing isn't helping.

Yes, you're right. But I think we can parallelize the update of distance metric after selecting a cluster center. To be specific,

Before (L.84 of knnsearch.py)

dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric)

After

dist = sklearn.metrics.pairwise_distances(self.features, x, metric=self.metric, j_jobs=4)

In the above code, I specified 4 as a number of parallelization, but you seem to have a lot of CPU cores, so you can increase the number. If you already specified the parameter of joblib on the other place, the above modification may have no effect, sorry...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants