I am currently using SlowFast on Facebook.
I am using two 4060ti(16GB) GPUs for training, and I have successfully parallelized the computation using NCCL on Ubuntu.
However, when the memory consumption exceeds the capacity of one GPU (16GB), the training does not proceed. It seems that only parallel computation within 16GB is happening, instead of utilizing the full 32GB (16+16).
For example) When training the same model with a single GPU on Windows, it consumes 15.6GB of VRAM and the training proceeds without any issues. However, when using two GPUs in a Ubuntu environment, an “out of memory” (OOM) error occurs:
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Reducing the size of the training data allows the training to continue.
It’s unclear whether this issue arises from exceeding the VRAM limit on GPU0, which might be over 16GB, or if it’s due to insufficient RAM allocated to Ubuntu causing the OOM problem.
I have heard about methods like Distribute and DDP for this purpose.
In the SlowFast model, there is a setting gpu_id = None, which seems like it can be changed. However, when I set gpu_id=[0,1], it doesn’t work and throws an error.
– Process 1 terminated with the following error:
Traceback (most recent call last):
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py”, line 69, in _wrap
fn(i, *args)
File “/home/hookkiring/slowfast/slowfast/utils/multiprocessing.py”, line 60, in run
ret = func(cfg)
File “/home/hookkiring/slowfast/tools/train_net.py”, line 594, in train
model = build_model(cfg)
File “/home/hookkiring/slowfast/slowfast/models/build.py”, line 67, in build_model
model = model.cuda(device=cur_device)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 689, in cuda
return self._apply(lambda t: t.cuda(device))
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 579, in _apply
module._apply(fn)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 579, in _apply
module._apply(fn)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 579, in _apply
module._apply(fn)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 602, in _apply
param_applied = fn(param)
File “/home/hookkiring/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py”, line 689, in
return self._apply(lambda t: t.cuda(device))
TypeError: cuda(): argument ‘device’ (position 1) must be torch.device, not list
The registered object will be called with `obj(cfg)`.
The call should return a `torch.nn.Module` object.
"""
def build_model(cfg, gpu_id=None):
"""
Builds the video model.
Args:
cfg (configs): configs that contains the hyper-parameters to build the
backbone. Details can be seen in slowfast/config/defaults.py.
gpu_id (Optional[int]): specify the gpu index to build model.
"""
if torch.cuda.is_available():
assert (
cfg.NUM_GPUS <= torch.cuda.device_count()
), "Cannot use more GPU devices than available"
else:
assert (
cfg.NUM_GPUS == 0
), "Cuda is not available. Please set `NUM_GPUS: 0 for running on CPUs."
# Construct the model
name = cfg.MODEL.MODEL_NAME
model = MODEL_REGISTRY.get(name)(cfg)
if cfg.BN.NORM_TYPE == "sync_batchnorm_apex":
try:
import apex
except ImportError:
raise ImportError("APEX is required for this model, pelase install")
logger.info("Converting BN layers to Apex SyncBN")
process_group = apex.parallel.create_syncbn_process_group(
group_size=cfg.BN.NUM_SYNC_DEVICES
)
model = apex.parallel.convert_syncbn_model(
model, process_group=process_group
)
if cfg.NUM_GPUS:
if gpu_id is None:
# Determine the GPU used by the current process
cur_device = torch.cuda.current_device()
else:
cur_device = gpu_id
# Transfer the model to the current GPU device
model = model.cuda(device=cur_device)
# Use multi-process data parallel model in the multi-gpu setting
if cfg.NUM_GPUS > 1:
# Make model replica operate on the current device
model = torch.nn.parallel.DistributedDataParallel(
module=model,
device_ids=[cur_device],
output_device=cur_device,
find_unused_parameters=True
if cfg.MODEL.DETACH_FINAL_FC
or cfg.MODEL.MODEL_NAME == "ContrastiveModel"
else False,
)
if cfg.MODEL.FP16_ALLREDUCE:
model.register_comm_hook(
state=None, hook=comm_hooks_default.fp16_compress_hook
)
return model
type or paste code here