Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitening Update #79

Merged
merged 21 commits into from
Dec 2, 2024
Merged

Whitening Update #79

merged 21 commits into from
Dec 2, 2024

Conversation

HarrisonSantiago
Copy link
Collaborator

  • Whiten.py works with data on [N, D] level
  • Images.py is a high level wrapper that contains additional support for whitening images
  • Patches.py -> Images.py
  • Updated paths in misc example notebooks
  • Created new example notebook showing how to use whitening

@HarrisonSantiago HarrisonSantiago added the enhancement New feature or request label Nov 23, 2024
This was linked to issues Nov 23, 2024
@belsten
Copy link
Collaborator

belsten commented Nov 27, 2024

I've noticed an issue with how the ZCA/PCA whitening is done.

As of right now, compute_whitening_stats masks out the eigenvalues/vectors based on n_components. This masking should occur after the inversion of the eigenvalues (i.e. in whiten) and that mask only needs to be applied to the eigenvalues. Right now, zeros are being inverted in whiten resulting in nans.

One way to solve this is pass n_components to whiten, which would be better because then it would be with epsilon, a parameter that has a similar function. Then whiten would perform the logic of keeping however many PCs via masking.

Here is a minimum working example that demonstrates the issue:

A = torch.from_numpy(np.asarray([
    [1,0.2,0.4],
    [0.2,1,0.2],
    [0.4,0.2,1],
])).float()
data = torch.randn(10000,3)@A.T

# set n_components less than 3 and epsilon=0
whitened_stats = compute_whitening_stats(data, n_components=2)
whiten(data,algorithm="zca",stats=whitened_stats)

Output:

tensor([[nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan],
        ...,
        [nan, nan, nan],
        [nan, nan, nan],
        [nan, nan, nan]])

Copy link
Collaborator

@belsten belsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See latest comment regarding PCA/ZCA whitening.

@HarrisonSantiago
Copy link
Collaborator Author

Thanks for catching that! Moved it into whiten and now it only masks eigenvalues. Made some changes in the accompanying notebook and images.py to reflect

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be unstable if the covariance matrix is ill-conditioned? Since L is triangular, could you use https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html?

Copy link
Collaborator Author

@HarrisonSantiago HarrisonSantiago Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is switching to cholesky-stable what you were thinking? https://colab.research.google.com/drive/1W8FdF-K6hYE3W0wOmRSGVQv9kPLaRH18?usp=sharing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the notebook, I should've been a good reviewer like Alex and given you an example to start with 😅.

What I had in mind initially was
X_whitened = torch.linalg.solve_triangular(L, X.T, upper=False).T
to solve for the whitened data directly, but I think what you have works as well, plus it's consistent with the other implementations. And maybe it's nice to compute W explicitly if we want to return it in the future (e.g. to whiten other similar data).

Looks like the errors are actually not too far off the for the two methods, but using solve_triangular is ~2x faster (added some timing code): https://colab.research.google.com/drive/1CIfNaWEbOfU9R8WeZYEQ4XepJU6ZWIRy?usp=sharing

And here's a nice writeup I found about matrix inversions in practice: https://gregorygundersen.com/blog/2020/12/09/matrix-inversion/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that an option to return the whitening matrix would be useful. It's helpful if you're converting between whitened/unwhitened data or want to visualize the whitening transform.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the write up! I went with using solve_triangular but not solving for X_whitened directly, I like having the functionality of returning the whitening matrix for all of the methods

def whiten(X: torch.Tensor,
algorithm: str = 'zca',
stats: Dict = None,
n_components=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
n_components=None,
n_components = None,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

----------
Whitened data of shape [N, D]

Notes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful for documentation to

  • add a few words (super short) for why someone would want to choose PCA or Cholesky over the the default ZCA, and then direct them to the paper or stackexchange for more details.
  • add short example usage in a code block that calls compute_whitening_stats and then whiten (not sure how to format this for Sphinx @belsten). Maybe this is excessive, but I think we should be clear because these are separate functions. Or could mention first calling compute_whitening_stats in the stats Parameter field.

What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cover all of that in the whitening notebook that'll be added to the documentation. What about adding that notebook to the references to encourage people to check it out?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pointer to the notebook would work instead IMO! And it avoids bloat in the docstring.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Collaborator

@9q9q 9q9q Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add tests for compute_whitening_stats and whitening using PCA and Cholesky.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@HarrisonSantiago HarrisonSantiago merged commit 08068b8 into main Dec 2, 2024
2 checks passed
@HarrisonSantiago HarrisonSantiago deleted the harrison/whiten branch December 2, 2024 23:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image Whitening implement torchvision data transform
3 participants