Skip to content

HOW TO: Training in Google Colab (Single T4) and "NotImplementedError" #39

@LeMosquitar

Description

@LeMosquitar

Hello, I am trying to play around with what is here. Thank you for your efforts by the way!

  1. I tried to run the project in Google colab, cloned the repo installed requirements and ran inference.
  2. I got output which tells me I have Installed things properly
  3. I then prepare for training
    -> I followed folder structure and dataset format
    -> Went to custom_detection.yml and changed coco remap to false
    -> I also changed the parameters in custom_detection.yml as gleaned below:
task: detection

evaluator:
  type: CocoEvaluator
  iou_types: ['bbox', ]

num_classes: 3 # your dataset classes
remap_mscoco_category: False

train_dataloader:
  type: DataLoader
  dataset:
    type: CocoDetection
    img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/train
    ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_train.json
    return_masks: False
    transforms:
      type: Compose
      ops: ~
  shuffle: True
  num_workers: 4
  drop_last: True
  collate_fn:
    type: BatchImageCollateFunction


val_dataloader:
  type: DataLoader
  dataset:
    type: CocoDetection
    img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/val
    ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_val.json
    return_masks: False
    transforms:
      type: Compose
      ops: ~
  shuffle: False
  num_workers: 4
  drop_last: False
  collate_fn:
    type: BatchImageCollateFunction

And my dataloader.yml to (rduce batch size):

train_dataloader:
  dataset:
    transforms:
      ops:
        - {type: RandomPhotometricDistort, p: 0.5}
        - {type: RandomZoomOut, fill: 0}
        - {type: RandomIoUCrop, p: 0.8}
        - {type: SanitizeBoundingBoxes, min_size: 1}
        - {type: RandomHorizontalFlip}
        - {type: Resize, size: [640, 640], }
        - {type: SanitizeBoundingBoxes, min_size: 1}
        - {type: ConvertPILImage, dtype: 'float32', scale: True}
        - {type: ConvertBoxes, fmt: 'cxcywh', normalize: True}
      policy:
        name: stop_epoch
        epoch: 72 # epoch in [71, ~) stop `ops`
        ops: ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']

  collate_fn:
    type: BatchImageCollateFunction
    base_size: 640
    base_size_repeat: 3
    stop_epoch: 72 # epoch in [72, ~) stop `multiscales`

  shuffle: True
  total_batch_size: 8 # total batch size equals to 32 (4 * 8)
  num_workers: 4


val_dataloader:
  dataset:
    transforms:
      ops:
        - {type: Resize, size: [640, 640], }
        - {type: ConvertPILImage, dtype: 'float32', scale: True}
  shuffle: False
  total_batch_size: 8
  num_workers: 4
  1. I then did not modify anything else and proceeded to the training using the command:
!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py -c "/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml" --use-amp --seed=0 -t "/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth"

I then got the following output:

2025-02-28 09:13:07.162205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740733987.183540   13770 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740733987.190107   13770 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-28 09:13:07.211146: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Initialized distributed mode...
cfg:  {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'epoches': 120, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 3, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'CocoEvaluator', 'iou_types': ['bbox']}, 'num_classes': 80, 'remap_mscoco_category': False, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/train2017/', 'ann_file': '/datassd/COCO/annotations/instances_train2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probability': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 117], 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'shuffle': True, 'num_workers': 4, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 3, 'stop_epoch': 117, 'scales': None, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'total_batch_size': 16}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/val2017/', 'ann_file': '/datassd/COCO/annotations/instances_val2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 4, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFunction'}, 'total_batch_size': 8}, 'print_freq': 100, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 2000, 'start': 0}, 'epoches': 120, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0002, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [1000], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 2000}, 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'PResNet', 'encoder': 'HybridEncoder', 'decoder': 'RTDETRTransformerv2'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 3, 'PResNet': {'depth': 18, 'variant': 'd', 'freeze_at': -1, 'return_idx': [1, 2, 3], 'num_stages': 4, 'freeze_norm': False, 'pretrained': True, 'local_model_dir': '../RT-DETR-main/rtdetrv2_pytorch/INK1k/'}, 'HybridEncoder': {'in_channels': [128, 256, 512], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 1, 'act': 'silu', 'version': 'rt_detrv2'}, 'RTDETRTransformerv2': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'eval_idx': -1, 'num_points': [4, 4, 4], 'cross_attn_method': 'default', 'query_select_method': 'default', 'query_pos_method': 'as_reg', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_mal': 1}, 'losses': ['mal', 'boxes'], 'alpha': 0.75, 'gamma': 1.5, 'use_uni_set': False, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./rtdetrv2_r18vd_120e_coco.yml', '../base/rt_deim.yml'], 'config': '/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml', 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
/content/DEIM/engine/backbone/presnet.py:227: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(model_path, map_location='cpu')
Loaded PResNet18 from local file@../RT-DETR-main/rtdetrv2_pytorch/INK1k/ResNet18_vd_pretrained_from_paddle.pth.
Load PResNet18 state_dict
     ### Query Position Embedding@as_reg ###     
Tuning checkpoint from /content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth
/content/DEIM/engine/solver/_solver.py:169: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state = torch.load(path, map_location='cpu')
Load model.state_dict, {'missed': [], 'unmatched': []}
/content/DEIM/engine/core/workspace.py:180: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  return module(**module_kwargs)
Initial lr: [0.0002, 0.0002]
building train_dataloader with batch_size=16...
     ### Transform @Mosaic ###    
     ### Transform @RandomPhotometricDistort ###    
     ### Transform @RandomZoomOut ###    
     ### Transform @RandomIoUCrop ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @RandomHorizontalFlip ###    
     ### Transform @Resize ###    
     ### Transform @SanitizeBoundingBoxes ###    
     ### Transform @ConvertPILImage ###    
     ### Transform @ConvertBoxes ###    
     ### Mosaic with Prob.@0.5 and ZoomOut/IoUCrop existed ### 
     ### ImgTransforms Epochs: [4, 64, 117] ### 
     ### Policy_ops@['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop'] ###
[rank0]: Traceback (most recent call last):
[rank0]:   File "/content/DEIM/train.py", line 84, in <module>
[rank0]:     main(args)
[rank0]:   File "/content/DEIM/train.py", line 54, in main
[rank0]:     solver.fit()
[rank0]:   File "/content/DEIM/engine/solver/det_solver.py", line 25, in fit
[rank0]:     self.train()
[rank0]:   File "/content/DEIM/engine/solver/_solver.py", line 87, in train
[rank0]:     self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
[rank0]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/yaml_config.py", line 76, in train_dataloader
[rank0]:     self._train_dataloader = self.build_dataloader('train_dataloader')
[rank0]:                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/yaml_config.py", line 172, in build_dataloader
[rank0]:     loader = create(name, global_cfg, batch_size=bs)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/workspace.py", line 119, in create
[rank0]:     return create(name, global_cfg)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/workspace.py", line 167, in create
[rank0]:     module_kwargs[k] = create(name, global_cfg)
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/core/workspace.py", line 180, in create
[rank0]:     return module(**module_kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/content/DEIM/engine/data/dataset/coco_dataset.py", line 33, in __init__
[rank0]:     super(CocoDetection, self).__init__(img_folder, ann_file)
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/torchvision/datasets/coco.py", line 37, in __init__
[rank0]:     self.coco = COCO(annFile)
[rank0]:                 ^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/coco.py", line 57, in __init__
[rank0]:     self.dataset = self.load_json(annotation_file, self.use_deepcopy)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/coco.py", line 302, in load_json
[rank0]:     with open(json_file) as io:
[rank0]:          ^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/datassd/COCO/annotations/instances_train2017.json'
E0228 09:13:17.895000 13755 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 13770) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-02-28_09:13:17
  host      : 2c5ae9ce8b33
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13770)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Is my method in training correct? I followed steps but I seem to be missing something. Also I notice that why does the training need to search for '/datassd/COCO/annotations/instances_train2017.json' when I am intending for custom dataset?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions