-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Hi. The current training code produces nan losses. The issue can be reproduced using the tutorial Colab notebook from the repo. See below the output from ridge regression example (but it also happens for the others):
init dist
Not using distributed
ALL: Using device cuda.
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': True, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': False, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
Using linear y encoder, as no y_encoder was provided.
Using a Transformer with 14.14 M parameters
Checkpoint file None not found or load/save paths are identical and file doesn't exist. Starting from scratch.
-----------------------------------------------------------------------------------------
| end of epoch 1 | time: 11.80s | mean loss nan | lr 0.0 | data time 0.00 step time 0.11 forward time 0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share 0.10 ignore share (for classification tasks) nan
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch 2 | time: 13.49s | mean loss nan | lr 0.00015 | data time 0.00 step time 0.13 forward time 0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share 0.04 ignore share (for classification tasks) nan
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch 3 | time: 12.87s | mean loss nan | lr 0.0003 | data time 0.00 step time 0.12 forward time 0.01 | max gpu mem 1.0 GiB | gpu utilization 85.0 %| nan share 0.05 ignore share (for classification tasks) nan Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels