Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mini-batch training on GMM #19

Open
Daisy-GENG opened this issue May 24, 2022 · 3 comments
Open

Mini-batch training on GMM #19

Daisy-GENG opened this issue May 24, 2022 · 3 comments

Comments

@Daisy-GENG
Copy link

Hi,

I want to implement mini-batching training on GMM as discussed in #7 . However, I am little bit confused by the code gmm.reset_parameters(torch.Tensor(fvectors[:500].astype(np.float32))). I am not sure whether it is related to my version of pycave, or maybe my understanding to the code in #7 is wrong. My code doesn't work.

My code are as follows:

from pycave.bayes.gmm import GaussianMixture as GM
from dataloader.gmm_dataset import gmm_dataset

train_gmm_dataset = gmm_dataset(data_path)
train_dataset_loader = torch.utils.data.DataLoader(dataset=train_gmm_dataset,
                                                        batch_size=train_dataloader_config["batch_size"],
                                                        shuffle=train_dataloader_config["shuffle"],
                                                        num_workers=train_dataloader_config["num_workers"])

for i, data in enumerate(train_dataset_loader):  # data:[1, pt, 3]
    data = torch.squeeze(data, 0)
    gmm = GM(num_components=2, covariance_type="diag", init_strategy="kmeans")
    gmm.model_.reset_parameters(data)  
    history = gmm.fit(train_dataset_loader)

And the error is:

`GaussianMixture` has not been fitted yet

Thank you so much!

Best regards,
Daisy

@borchero
Copy link
Owner

borchero commented May 24, 2022

Issue #7 still referred to PyCave version 2. In PyCave v3, you don't need to call gmm.model_.reset_parameters: the model_ attribute will only be available once fit has returned without error.

I believe that this should be the line that causes your error.

@Daisy-GENG
Copy link
Author

So is there a similar way to implement batch training in PyCave version 3 using dataloader? My whole dataset is large, so I cannot load all the data into the memory once.

Thank you so much!

Best regards,
Daisy

@borchero
Copy link
Owner

Ah, sorry! Yes, you can simply set the batch size when initializing the GMM. In your case, you might, for example, use:

gmm = GM(..., batch_size=8192)

This will automatically take care to load data in batches, both for initialization and GMM training. Note that you might be better off with init_strategy='kmeans++' since kmeans is quite costly to run. You'll need PyCave 3.1.3 for that, though (there was a bug for kmeans++ initialization before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants