Skip to content

Multi-GPU Validation fails #931

@abhiagwl4262

Description

@abhiagwl4262

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

NCCL timeout in multi-GPU training

I trained for 2 epochs on single-GPU machine and then wanted to run the training on multi-GPU machine for faster training but I am seeing NCCL timeout after being stuck for a long time at the end of eval

Epoch 2: 100%|███████████████████████████████████████████████████████| 2289/2289 [19:11<00:00,  1.99it/s, train/lr=0.0001, train/lr_min=4.04e-6, train/lr_max=0.0001^
[  idation DataLoader 0: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 303/303 [02:01<00:00,  2.49it/s]
                                                                                                                                                                     

^[[A^[[A^[[A^[[A^[[A^[[A^[[A^[  [rank1]:[E407 10:02:26.551756518 ProcessGroupNCCL.cpp:683] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=492
09, OpType=ALLGATHER, NumelIn=43, NumelOut=172, Timeout(ms)=1800000) ran for 1800056 milliseconds before timing out.                                                 
[rank1]:[E407 10:02:26.551997332 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 49209 PG status
: last enqueued work: 49209, last completed work: 49208                                                                                                              
[rank1]:[E407 10:02:26.552039573 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.                                                                                                 
[rank1]:[E407 10:02:26.552111664 ProcessGroupNCCL.cpp:2573] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping.                          
[rank3]:[E407 10:02:26.564157461 ProcessGroupNCCL.cpp:683] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=49210, OpType=ALLREDUCE, NumelIn=1,
 NumelOut=1, Timeout(ms)=1800000) ran for 1800069 milliseconds before timing out.                                                                                    
[rank3]:[E407 10:02:26.564359555 ProcessGroupNCCL.cpp:2241] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 49210 PG status
: last enqueued work: 49210, last completed work: 49209                                                                                                              
[rank3]:[E407 10:02:26.564374605 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can en
able it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value

Environment

  • rfdetr=latest
  • num-gpus = 4
  • Driver Version: 580.95.05
  • CUDA Version: 13.0
  • 2.9.1+cu128

Minimal Reproducible Example

from rfdetr import RFDETRBase

model = RFDETRBase()

model.train(
    dataset_dir="IE_Z22_043/images",
    epochs=200,
    batch_size=8,
    grad_accum_steps=4,
    task="segment",
    resolution=624,
    output_dir="rfdetr_ie_z22_043_finetuning",
    progress_bar=True,
    # fp16_eval=True,
    early_stopping=True,
    devices="auto",  # required — see note below
    strategy='ddp_find_unused_parameters_true',
    persistent_workers=True,
    device='cuda',
    checkpoint_interval=1,
    num_workers=4,
    warmup_epochs=5,
    pin_memory = True,
    resume="rfdetr_ie_z22_043_finetuning/checkpoint_1.ckpt"
)

I ran with torchrun --nproc_per_node=4 train_rfdetr.py

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions