Skip to content

NaN loss during training #28

@herilalaina

Description

@herilalaina

Hi. The current training code produces nan losses. The issue can be reproduced using the tutorial Colab notebook from the repo. See below the output from ridge regression example (but it also happens for the others):

init dist
Not using distributed
ALL: Using device cuda.
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
	<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': True, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
	<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': False, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
Using linear y encoder, as no y_encoder was provided.
Using a Transformer with 14.14 M parameters
Checkpoint file None not found or load/save paths are identical and file doesn't exist. Starting from scratch.
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 11.80s | mean loss   nan | lr 0.0 | data time  0.00 step time  0.11 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share  0.10 ignore share (for classification tasks)   nan 
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 13.49s | mean loss   nan | lr 0.00015 | data time  0.00 step time  0.13 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share  0.04 ignore share (for classification tasks)   nan 
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 12.87s | mean loss   nan | lr 0.0003 | data time  0.00 step time  0.12 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 85.0 %| nan share  0.05 ignore share (for classification tasks)   nan 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions