NaN loss during training

Hi. The current training code produces nan losses. The issue can be reproduced using the tutorial [Colab notebook](https://colab.research.google.com/drive/12YpI99LkuFeWcuYHt_idl142DqX7AaJf) from the repo. See below the output from ridge regression example (but it also happens for the others):

```bash
init dist
Not using distributed
ALL: Using device cuda.
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
	<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': True, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
DataLoader.__dict__ {'get_batch_method': functools.partial(get_batch_sequence(
	<__main__.get_batch_for_ridge_regression (batch_size=2, seq_len=100, num_features=1, hyperparameters=None, device='cpu', **kwargs)
), num_features=1, hyperparameters=None), 'num_steps': 100, 'batch_shape_sampler_function': <bound method BatchShapeSamplerConfig.sample_batch_shape of BatchShapeSamplerConfig(batch_size=256, min_single_eval_pos=0, max_seq_len=20, min_num_features=1, max_num_features=1, fixed_num_test_instances=None, seed=42)>, 'num_workers': 0, 'persistent_workers': False, 'get_batch_kwargs': {'device': 'cuda', 'n_targets_per_input': 1}, 'epoch_count': 0, 'importance_sampling_infos': None}
Using linear y encoder, as no y_encoder was provided.
Using a Transformer with 14.14 M parameters
Checkpoint file None not found or load/save paths are identical and file doesn't exist. Starting from scratch.
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 11.80s | mean loss   nan | lr 0.0 | data time  0.00 step time  0.11 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share  0.10 ignore share (for classification tasks)   nan 
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   2 | time: 13.49s | mean loss   nan | lr 0.00015 | data time  0.00 step time  0.13 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 90.0 %| nan share  0.04 ignore share (for classification tasks)   nan 
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   3 | time: 12.87s | mean loss   nan | lr 0.0003 | data time  0.00 step time  0.12 forward time  0.01 | max gpu mem 1.0 GiB | gpu utilization 85.0 %| nan share  0.05 ignore share (for classification tasks)   nan 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN loss during training #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NaN loss during training #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions