-
Notifications
You must be signed in to change notification settings - Fork 183
Open
Description
Hello, I am trying to play around with what is here. Thank you for your efforts by the way!
- I tried to run the project in Google colab, cloned the repo installed requirements and ran inference.
- I got output which tells me I have Installed things properly
- I then prepare for training
-> I followed folder structure and dataset format
-> Went to custom_detection.yml and changed coco remap to false
-> I also changed the parameters in custom_detection.yml as gleaned below:
task: detection
evaluator:
type: CocoEvaluator
iou_types: ['bbox', ]
num_classes: 3 # your dataset classes
remap_mscoco_category: False
train_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/train
ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_train.json
return_masks: False
transforms:
type: Compose
ops: ~
shuffle: True
num_workers: 4
drop_last: True
collate_fn:
type: BatchImageCollateFunction
val_dataloader:
type: DataLoader
dataset:
type: CocoDetection
img_folder: /content/drive/MyDrive/v9-v1_augmented.coco/images/val
ann_file: /content/drive/MyDrive/v9-v1_augmented.coco/annotations/instances_val.json
return_masks: False
transforms:
type: Compose
ops: ~
shuffle: False
num_workers: 4
drop_last: False
collate_fn:
type: BatchImageCollateFunction
And my dataloader.yml to (rduce batch size):
train_dataloader:
dataset:
transforms:
ops:
- {type: RandomPhotometricDistort, p: 0.5}
- {type: RandomZoomOut, fill: 0}
- {type: RandomIoUCrop, p: 0.8}
- {type: SanitizeBoundingBoxes, min_size: 1}
- {type: RandomHorizontalFlip}
- {type: Resize, size: [640, 640], }
- {type: SanitizeBoundingBoxes, min_size: 1}
- {type: ConvertPILImage, dtype: 'float32', scale: True}
- {type: ConvertBoxes, fmt: 'cxcywh', normalize: True}
policy:
name: stop_epoch
epoch: 72 # epoch in [71, ~) stop `ops`
ops: ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']
collate_fn:
type: BatchImageCollateFunction
base_size: 640
base_size_repeat: 3
stop_epoch: 72 # epoch in [72, ~) stop `multiscales`
shuffle: True
total_batch_size: 8 # total batch size equals to 32 (4 * 8)
num_workers: 4
val_dataloader:
dataset:
transforms:
ops:
- {type: Resize, size: [640, 640], }
- {type: ConvertPILImage, dtype: 'float32', scale: True}
shuffle: False
total_batch_size: 8
num_workers: 4
- I then did not modify anything else and proceeded to the training using the command:
!CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 train.py -c "/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml" --use-amp --seed=0 -t "/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth"
I then got the following output:
2025-02-28 09:13:07.162205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1740733987.183540 13770 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1740733987.190107 13770 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-28 09:13:07.211146: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Initialized distributed mode...
cfg: {'task': 'detection', '_model': None, '_postprocessor': None, '_criterion': None, '_optimizer': None, '_lr_scheduler': None, '_lr_warmup_scheduler': None, '_train_dataloader': None, '_val_dataloader': None, '_ema': None, '_scaler': None, '_train_dataset': None, '_val_dataset': None, '_collate_fn': None, '_evaluator': None, '_writer': None, 'num_workers': 0, 'batch_size': None, '_train_batch_size': None, '_val_batch_size': None, '_train_shuffle': None, '_val_shuffle': None, 'resume': None, 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'epoches': 120, 'last_epoch': -1, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'no_aug_epoch': 3, 'warmup_iter': 2000, 'flat_epoch': 64, 'use_amp': True, 'use_ema': True, 'ema_decay': 0.9999, 'ema_warmups': 2000, 'sync_bn': True, 'clip_max_norm': 0.1, 'find_unused_parameters': False, 'seed': 0, 'print_freq': 100, 'checkpoint_freq': 4, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'summary_dir': None, 'device': '', 'yaml_cfg': {'task': 'detection', 'evaluator': {'type': 'CocoEvaluator', 'iou_types': ['bbox']}, 'num_classes': 80, 'remap_mscoco_category': False, 'train_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/train2017/', 'ann_file': '/datassd/COCO/annotations/instances_train2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Mosaic', 'output_size': 320, 'rotation_range': 10, 'translation_range': [0.1, 0.1], 'scaling_range': [0.5, 1.5], 'probability': 1.0, 'fill_value': 0, 'use_cache': False, 'max_cached_images': 50, 'random_pop': True}, {'type': 'RandomPhotometricDistort', 'p': 0.5}, {'type': 'RandomZoomOut', 'fill': 0}, {'type': 'RandomIoUCrop', 'p': 0.8}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'RandomHorizontalFlip'}, {'type': 'Resize', 'size': [640, 640]}, {'type': 'SanitizeBoundingBoxes', 'min_size': 1}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}, {'type': 'ConvertBoxes', 'fmt': 'cxcywh', 'normalize': True}], 'policy': {'name': 'stop_epoch', 'epoch': [4, 64, 117], 'ops': ['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']}, 'mosaic_prob': 0.5}}, 'shuffle': True, 'num_workers': 4, 'drop_last': True, 'collate_fn': {'type': 'BatchImageCollateFunction', 'base_size': 640, 'base_size_repeat': 3, 'stop_epoch': 117, 'scales': None, 'mixup_prob': 0.5, 'mixup_epochs': [4, 64]}, 'total_batch_size': 16}, 'val_dataloader': {'type': 'DataLoader', 'dataset': {'type': 'CocoDetection', 'img_folder': '/datassd/COCO/val2017/', 'ann_file': '/datassd/COCO/annotations/instances_val2017.json', 'return_masks': False, 'transforms': {'type': 'Compose', 'ops': [{'type': 'Resize', 'size': [640, 640]}, {'type': 'ConvertPILImage', 'dtype': 'float32', 'scale': True}]}}, 'shuffle': False, 'num_workers': 4, 'drop_last': False, 'collate_fn': {'type': 'BatchImageCollateFunction'}, 'total_batch_size': 8}, 'print_freq': 100, 'output_dir': './output/deim_rtdetrv2_r18vd_120e_coco', 'checkpoint_freq': 4, 'sync_bn': True, 'find_unused_parameters': False, 'use_amp': True, 'scaler': {'type': 'GradScaler', 'enabled': True}, 'use_ema': True, 'ema': {'type': 'ModelEMA', 'decay': 0.9999, 'warmups': 2000, 'start': 0}, 'epoches': 120, 'clip_max_norm': 0.1, 'optimizer': {'type': 'AdamW', 'params': [{'params': '^(?=.*(?:norm|bn)).*$', 'weight_decay': 0.0}], 'lr': 0.0002, 'betas': [0.9, 0.999], 'weight_decay': 0.0001}, 'lr_scheduler': {'type': 'MultiStepLR', 'milestones': [1000], 'gamma': 0.1}, 'lr_warmup_scheduler': {'type': 'LinearWarmup', 'warmup_duration': 2000}, 'model': 'DEIM', 'criterion': 'DEIMCriterion', 'postprocessor': 'PostProcessor', 'use_focal_loss': True, 'eval_spatial_size': [640, 640], 'DEIM': {'backbone': 'PResNet', 'encoder': 'HybridEncoder', 'decoder': 'RTDETRTransformerv2'}, 'lrsheduler': 'flatcosine', 'lr_gamma': 0.5, 'warmup_iter': 2000, 'flat_epoch': 64, 'no_aug_epoch': 3, 'PResNet': {'depth': 18, 'variant': 'd', 'freeze_at': -1, 'return_idx': [1, 2, 3], 'num_stages': 4, 'freeze_norm': False, 'pretrained': True, 'local_model_dir': '../RT-DETR-main/rtdetrv2_pytorch/INK1k/'}, 'HybridEncoder': {'in_channels': [128, 256, 512], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'use_encoder_idx': [2], 'num_encoder_layers': 1, 'nhead': 8, 'dim_feedforward': 1024, 'dropout': 0.0, 'enc_act': 'gelu', 'expansion': 0.5, 'depth_mult': 1, 'act': 'silu', 'version': 'rt_detrv2'}, 'RTDETRTransformerv2': {'feat_channels': [256, 256, 256], 'feat_strides': [8, 16, 32], 'hidden_dim': 256, 'num_levels': 3, 'num_layers': 3, 'num_queries': 300, 'num_denoising': 100, 'label_noise_ratio': 0.5, 'box_noise_scale': 1.0, 'eval_idx': -1, 'num_points': [4, 4, 4], 'cross_attn_method': 'default', 'query_select_method': 'default', 'query_pos_method': 'as_reg', 'activation': 'silu', 'mlp_act': 'silu'}, 'PostProcessor': {'num_top_queries': 300}, 'DEIMCriterion': {'weight_dict': {'loss_vfl': 1, 'loss_bbox': 5, 'loss_giou': 2, 'loss_mal': 1}, 'losses': ['mal', 'boxes'], 'alpha': 0.75, 'gamma': 1.5, 'use_uni_set': False, 'matcher': {'type': 'HungarianMatcher', 'weight_dict': {'cost_class': 2, 'cost_bbox': 5, 'cost_giou': 2}, 'alpha': 0.25, 'gamma': 2.0}}, '__include__': ['./rtdetrv2_r18vd_120e_coco.yml', '../base/rt_deim.yml'], 'config': '/content/DEIM/configs/deim_rtdetrv2/deim_r18vd_120e_coco.yml', 'tuning': '/content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth', 'seed': 0, 'test_only': False, 'print_method': 'builtin', 'print_rank': 0}}
/content/DEIM/engine/backbone/presnet.py:227: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(model_path, map_location='cpu')
Loaded PResNet18 from local file@../RT-DETR-main/rtdetrv2_pytorch/INK1k/ResNet18_vd_pretrained_from_paddle.pth.
Load PResNet18 state_dict
### Query Position Embedding@as_reg ###
Tuning checkpoint from /content/DEIM/deim_rtdetrv2_r18vd_coco_120e.pth
/content/DEIM/engine/solver/_solver.py:169: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(path, map_location='cpu')
Load model.state_dict, {'missed': [], 'unmatched': []}
/content/DEIM/engine/core/workspace.py:180: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
return module(**module_kwargs)
Initial lr: [0.0002, 0.0002]
building train_dataloader with batch_size=16...
### Transform @Mosaic ###
### Transform @RandomPhotometricDistort ###
### Transform @RandomZoomOut ###
### Transform @RandomIoUCrop ###
### Transform @SanitizeBoundingBoxes ###
### Transform @RandomHorizontalFlip ###
### Transform @Resize ###
### Transform @SanitizeBoundingBoxes ###
### Transform @ConvertPILImage ###
### Transform @ConvertBoxes ###
### Mosaic with Prob.@0.5 and ZoomOut/IoUCrop existed ###
### ImgTransforms Epochs: [4, 64, 117] ###
### Policy_ops@['Mosaic', 'RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop'] ###
[rank0]: Traceback (most recent call last):
[rank0]: File "/content/DEIM/train.py", line 84, in <module>
[rank0]: main(args)
[rank0]: File "/content/DEIM/train.py", line 54, in main
[rank0]: solver.fit()
[rank0]: File "/content/DEIM/engine/solver/det_solver.py", line 25, in fit
[rank0]: self.train()
[rank0]: File "/content/DEIM/engine/solver/_solver.py", line 87, in train
[rank0]: self.cfg.train_dataloader, shuffle=self.cfg.train_dataloader.shuffle
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/core/yaml_config.py", line 76, in train_dataloader
[rank0]: self._train_dataloader = self.build_dataloader('train_dataloader')
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/core/yaml_config.py", line 172, in build_dataloader
[rank0]: loader = create(name, global_cfg, batch_size=bs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/core/workspace.py", line 119, in create
[rank0]: return create(name, global_cfg)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/core/workspace.py", line 167, in create
[rank0]: module_kwargs[k] = create(name, global_cfg)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/core/workspace.py", line 180, in create
[rank0]: return module(**module_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/content/DEIM/engine/data/dataset/coco_dataset.py", line 33, in __init__
[rank0]: super(CocoDetection, self).__init__(img_folder, ann_file)
[rank0]: File "/usr/local/lib/python3.11/dist-packages/torchvision/datasets/coco.py", line 37, in __init__
[rank0]: self.coco = COCO(annFile)
[rank0]: ^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/coco.py", line 57, in __init__
[rank0]: self.dataset = self.load_json(annotation_file, self.use_deepcopy)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/usr/local/lib/python3.11/dist-packages/faster_coco_eval/core/coco.py", line 302, in load_json
[rank0]: with open(json_file) as io:
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: '/datassd/COCO/annotations/instances_train2017.json'
E0228 09:13:17.895000 13755 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 13770) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 10, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-28_09:13:17
host : 2c5ae9ce8b33
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 13770)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Is my method in training correct? I followed steps but I seem to be missing something. Also I notice that why does the training need to search for '/datassd/COCO/annotations/instances_train2017.json' when I am intending for custom dataset?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels