How can I use 2 gpu vram 100%? (SlowFast model)

I am currently using SlowFast on Facebook.
I am using two 4060ti(16GB) GPUs for training, and I have successfully parallelized the computation using NCCL on Ubuntu.

However, when the memory consumption exceeds the capacity of one GPU (16GB), the training does not proceed. It seems that only parallel computation within 16GB is happening, instead of utilizing the full 32GB (16+16).

For example) When training the same model with a single GPU on Windows, it consumes 15.6GB of VRAM and the training proceeds without any issues. However, when using two GPUs in a Ubuntu environment, an “out of memory” (OOM) error occurs:

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Reducing the size of the training data allows the training to continue.

It’s unclear whether this issue arises from exceeding the VRAM limit on GPU0, which might be over 16GB, or if it’s due to insufficient RAM allocated to Ubuntu causing the OOM problem.
I have heard about methods like Distribute and DDP for this purpose.

In the SlowFast model, there is a setting gpu_id = None, which seems like it can be changed. However, when I set gpu_id=[0,1], it doesn’t work and throws an error.

– Process 1 terminated with the following error:
Traceback (most recent call last):
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/multiprocessing/”, line 69, in _wrap
fn(i, *args)
File “/home/hookkiring/slowfast/slowfast/utils/”, line 60, in run
ret = func(cfg)
File “/home/hookkiring/slowfast/tools/”, line 594, in train
model = build_model(cfg)
File “/home/hookkiring/slowfast/slowfast/models/”, line 67, in build_model
model = model.cuda(device=cur_device)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 689, in cuda
return self._apply(lambda t: t.cuda(device))
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 579, in _apply
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 579, in _apply
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 579, in _apply
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 602, in _apply
param_applied = fn(param)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/”, line 689, in
return self._apply(lambda t: t.cuda(device))
TypeError: cuda(): argument ‘device’ (position 1) must be torch.device, not list

The registered object will be called with `obj(cfg)`.
The call should return a `torch.nn.Module` object.

def build_model(cfg, gpu_id=None):
    Builds the video model.
        cfg (configs): configs that contains the hyper-parameters to build the
        backbone. Details can be seen in slowfast/config/
        gpu_id (Optional[int]): specify the gpu index to build model.
    if torch.cuda.is_available():
        assert (
            cfg.NUM_GPUS <= torch.cuda.device_count()
        ), "Cannot use more GPU devices than available"
        assert (
            cfg.NUM_GPUS == 0
        ), "Cuda is not available. Please set `NUM_GPUS: 0 for running on CPUs."

    # Construct the model
    name = cfg.MODEL.MODEL_NAME
    model = MODEL_REGISTRY.get(name)(cfg)

    if cfg.BN.NORM_TYPE == "sync_batchnorm_apex":
            import apex
        except ImportError:
            raise ImportError("APEX is required for this model, pelase install")"Converting BN layers to Apex SyncBN")
        process_group = apex.parallel.create_syncbn_process_group(
        model = apex.parallel.convert_syncbn_model(
            model, process_group=process_group

    if cfg.NUM_GPUS:
        if gpu_id is None:
            # Determine the GPU used by the current process
            cur_device = torch.cuda.current_device()
            cur_device = gpu_id
        # Transfer the model to the current GPU device
        model = model.cuda(device=cur_device)
    # Use multi-process data parallel model in the multi-gpu setting
    if cfg.NUM_GPUS > 1:
        # Make model replica operate on the current device
        model = torch.nn.parallel.DistributedDataParallel(
            if cfg.MODEL.DETACH_FINAL_FC
            or cfg.MODEL.MODEL_NAME == "ContrastiveModel"
            else False,
        if cfg.MODEL.FP16_ALLREDUCE:
                state=None, hook=comm_hooks_default.fp16_compress_hook
    return model

type or paste code here