Inconsistent GPU usage #100

alpergel · 2025-01-07T14:44:28Z

Hey @yzslab Im still having the issue of gpu usage spiking and causing non-optimal performance do you have any hypotheses that I can explore to try to fix this?

yzslab · 2025-01-07T14:50:17Z

Have you met the same issue on graphdeco-inria/gaussian-splatting?

alpergel · 2025-01-07T14:51:05Z

nope!

alpergel · 2025-01-07T14:53:00Z

If it helps, I think its wherever the TQDM is looking over the epoch iteration, since whenever it freezes tqdm's curr time and est time are 00:00<00:00

yzslab · 2025-01-07T15:01:20Z

I really have no idea about it currently, since I can not reproduce it.

Or you can try upgrading the PyTorch and Lightning:

conda create -yn gspl-torch25 python=3.12 pip
conda activate gspl-torch25

pip install -r requirements/pyt251_cu124.txt
# make sure you have installed CUDA 12.4, and the `nvcc -V` prints the right version
pip install -r requirements/lightning25.txt
pip install -r requirements/gsplat.txt
pip install -r requirements/fused-ssim.txt

yzslab · 2025-01-07T15:07:10Z

Have you tried the simplest command, i.e. without specificing any extra options except the experiment name: python main.py fit --data.path ... -n ...?

alpergel · 2025-01-07T16:28:29Z

Have you tried the simplest command, i.e. without specificing any extra options except the experiment name: python main.py fit --data.path ... -n ...?

still happening unfortnately

yzslab · 2025-01-08T16:31:59Z

Try turning off the logger to see if it works:

Comment out the line 367-370 here:

gaussian-splatting-lightning/internal/gaussian_splatting.py

Lines 367 to 370 in 1aa5e01

    
           self.logger.log_metrics( 
        
               metrics_to_log, 
        
               step=self.trainer.global_step, 
        
           )

Start training with --logger none

alpergel · 2025-01-16T11:52:50Z

Yes! That was it!!!

alpergel · 2025-01-16T11:53:46Z

Its nearly 10 minutes faster! I highly suggest that comment is an option by default, if thats possible

yzslab · 2025-01-16T12:06:56Z

Normally the logger should not have such a huge overhead.
The bottleneck may be your storage. Check if the wa shown in the top command is high.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent GPU usage #100

Inconsistent GPU usage #100

alpergel commented Jan 7, 2025

yzslab commented Jan 7, 2025

alpergel commented Jan 7, 2025

alpergel commented Jan 7, 2025

yzslab commented Jan 7, 2025

yzslab commented Jan 7, 2025 •

edited

Loading

alpergel commented Jan 7, 2025

yzslab commented Jan 8, 2025

alpergel commented Jan 16, 2025

alpergel commented Jan 16, 2025

yzslab commented Jan 16, 2025

Inconsistent GPU usage #100

Inconsistent GPU usage #100

Comments

alpergel commented Jan 7, 2025

yzslab commented Jan 7, 2025

alpergel commented Jan 7, 2025

alpergel commented Jan 7, 2025

yzslab commented Jan 7, 2025

yzslab commented Jan 7, 2025 • edited Loading

alpergel commented Jan 7, 2025

yzslab commented Jan 8, 2025

alpergel commented Jan 16, 2025

alpergel commented Jan 16, 2025

yzslab commented Jan 16, 2025

yzslab commented Jan 7, 2025 •

edited

Loading