Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime cuDNN error when training custom non-Latin-character model #1345

Open
S0mbre opened this issue Dec 8, 2024 · 1 comment
Open

Runtime cuDNN error when training custom non-Latin-character model #1345

S0mbre opened this issue Dec 8, 2024 · 1 comment

Comments

@S0mbre
Copy link

S0mbre commented Dec 8, 2024

When training a custom model using the provided training script, in a Google Colab environment, I constatnly get the following cuDNN errors:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-9-2d2174729bbd>](https://localhost:8080/#) in <cell line: 13>()
     11 
     12 # force_cudnn_initialization()
---> 13 train(opt, amp=False)

5 frames
[/content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/train.py](https://localhost:8080/#) in train(opt, show_number, amp)
    233                 with torch.no_grad():
    234                     valid_loss, current_accuracy, current_norm_ED, preds, confidence_score, labels,\
--> 235                     infer_time, length_of_data = validation(model, criterion, valid_loader, converter, opt, device)
    236                 model.train()
    237 

[/content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/test.py](https://localhost:8080/#) in validation(model, criterion, evaluation_loader, converter, opt, device)
     41             preds_size = torch.IntTensor([preds.size(1)] * batch_size)
     42             # permute 'preds' to use CTCloss format
---> 43             cost = criterion(preds.log_softmax(2).permute(1, 0, 2), text_for_loss, preds_size, length_for_loss)
     44 
     45             if opt.decode == 'greedy':

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1734             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735         else:
-> 1736             return self._call_impl(*args, **kwargs)
   1737 
   1738     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1745                 or _global_backward_pre_hooks or _global_backward_hooks
   1746                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747             return forward_call(*args, **kwargs)
   1748 
   1749         result = None

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py](https://localhost:8080/#) in forward(self, log_probs, targets, input_lengths, target_lengths)
   1978         target_lengths: Tensor,
   1979     ) -> Tensor:
-> 1980         return F.ctc_loss(
   1981             log_probs,
   1982             targets,

[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in ctc_loss(log_probs, targets, input_lengths, target_lengths, blank, reduction, zero_infinity)
   3067             zero_infinity=zero_infinity,
   3068         )
-> 3069     return torch.ctc_loss(
   3070         log_probs,
   3071         targets,

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Software versions

Python 3.10
nvcc: NVIDIA (R) Cuda compiler driver
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvcc-cu12==12.6.85
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.6.0.74
nvidia-cufft-cu12==11.3.0.4
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-nccl-cu12==2.23.4
nvidia-nvjitlink-cu12==12.6.85
torch @ https://download.pytorch.org/whl/cu121_full/torch-2.5.1%2Bcu121-cp310-cp310-linux_x86_64.whl
torchvision @ https://download.pytorch.org/whl/cu121/torchvision-0.20.1%2Bcu121-cp310-cp310-linux_x86_64.whl

Model train config

number: 0123456789
symbol: .,: 
lang_char: АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя
experiment_name: ru_filtered
train_data: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/all_data
valid_data: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/all_data/ru_val
manualSeed: 1111
workers: 1
batch_size: 32
num_iter: 30000
valInterval: 200
saved_model: /content/gdrive/My Drive/Colab Notebooks/easyocr/trainer/saved_models/ru_filtered/cyrillic_g2.pth
FT: False
optim: False
lr: 1.0
beta1: 0.9
rho: 0.95
eps: 1e-08
grad_clip: 5
select_data: ['train', 'val']
batch_ratio: ['0.8', '0.2']
total_data_usage_ratio: 1.0
batch_max_length: 34
imgH: 64
imgW: 600
rgb: False
contrast_adjust: 0.0
sensitive: True
PAD: True
data_filtering_off: False
Transformation: None
FeatureExtraction: VGG
SequenceModeling: BiLSTM
Prediction: CTC
num_fiducial: 20
input_channel: 1
output_channel: 256
hidden_size: 256
decode: greedy
new_prediction: True    # !!! HAD TO SET TO TRUE BECAUSE OF SIZE MISMATCH ERROR
freeze_FeatureFxtraction: False
freeze_SequenceModeling: False
character: 0123456789.,: АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя
num_class: 80

Please also see above: when setting to train from pretrained model ("cyrillic_g2.pth") if I set "new_prediction" to False, I get the following error:

size mismatch for module.Prediction.weight: copying a param with shape torch.Size([208, 256]) from checkpoint, the shape in current model is torch.Size([80, 256])
@amin-aa
Copy link

amin-aa commented Dec 26, 2024

I had the same problem. You need to compute the 'cost' on the CPU. To do this, disable CUDA before computing the 'cost' and enable it again afterward in the 'test.py' file (line 43):

torch.backends.cudnn.enabled = False
cost = criterion(preds.log_softmax(2).permute(1, 0, 2), text_for_loss, preds_size, length_for_loss)
torch.backends.cudnn.enabled = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants