Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent GPU usage #100

Open
alpergel opened this issue Jan 7, 2025 · 10 comments
Open

Inconsistent GPU usage #100

alpergel opened this issue Jan 7, 2025 · 10 comments

Comments

@alpergel
Copy link

alpergel commented Jan 7, 2025

Hey @yzslab Im still having the issue of gpu usage spiking and causing non-optimal performance do you have any hypotheses that I can explore to try to fix this?
image

@yzslab
Copy link
Owner

yzslab commented Jan 7, 2025

Have you met the same issue on graphdeco-inria/gaussian-splatting?

@alpergel
Copy link
Author

alpergel commented Jan 7, 2025

nope!

@alpergel
Copy link
Author

alpergel commented Jan 7, 2025

If it helps, I think its wherever the TQDM is looking over the epoch iteration, since whenever it freezes tqdm's curr time and est time are 00:00<00:00

@yzslab
Copy link
Owner

yzslab commented Jan 7, 2025

I really have no idea about it currently, since I can not reproduce it.

Or you can try upgrading the PyTorch and Lightning:

conda create -yn gspl-torch25 python=3.12 pip
conda activate gspl-torch25

pip install -r requirements/pyt251_cu124.txt
# make sure you have installed CUDA 12.4, and the `nvcc -V` prints the right version
pip install -r requirements/lightning25.txt
pip install -r requirements/gsplat.txt
pip install -r requirements/fused-ssim.txt

@yzslab
Copy link
Owner

yzslab commented Jan 7, 2025

Have you tried the simplest command, i.e. without specificing any extra options except the experiment name: python main.py fit --data.path ... -n ...?

@alpergel
Copy link
Author

alpergel commented Jan 7, 2025

Have you tried the simplest command, i.e. without specificing any extra options except the experiment name: python main.py fit --data.path ... -n ...?

still happening unfortnately

@yzslab
Copy link
Owner

yzslab commented Jan 8, 2025

Try turning off the logger to see if it works:

  1. Comment out the line 367-370 here:
    self.logger.log_metrics(
    metrics_to_log,
    step=self.trainer.global_step,
    )
  2. Start training with --logger none

@alpergel
Copy link
Author

Yes! That was it!!!

@alpergel
Copy link
Author

Its nearly 10 minutes faster! I highly suggest that comment is an option by default, if thats possible

@yzslab
Copy link
Owner

yzslab commented Jan 16, 2025

Normally the logger should not have such a huge overhead.
The bottleneck may be your storage. Check if the wa shown in the top command is high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants