I recently came across a code that was trained on multiple gpus. However, since i don’t have access to multiple gpu(gcp or aws),my only option is to train on colab.
as such, is it possible to convert a code that was written for training on multiple gpus to single gpu based code.
I know that the standard process for training on multiple gpu people usually change the sampler from random or sequential to distributed data sampler.
But after looking at the code for distributed training, seems there are other nuances involved in it.
For reference here is the link to the repo for distributed training as well as the code section which i believe might need to be changed.
Repository that uses multiple gpu or dsitributed training
https://github.com/DerrickWang005/CRIS.pytorch
Code block which i think might need to be changed,so as to ensure i can train on single gpu
def main_worker(gpu, args):
args.output_dir = os.path.join(args.output_folder, args.exp_name)
# local rank & global rank
args.gpu = gpu
args.rank = args.rank * args.ngpus_per_node + gpu
torch.cuda.set_device(args.gpu)
# logger
setup_logger(args.output_dir,
distributed_rank=args.gpu,
filename="train.log",
mode="a")
# dist init
dist.init_process_group(backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank)
# wandb
if args.rank == 0:
wandb.init(job_type="training",
mode="online",
config=args,
project="CRIS",
name=args.exp_name,
tags=[args.dataset, args.clip_pretrain])
dist.barrier()
# build model
model, param_list = build_segmenter(args)
if args.sync_bn:
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
logger.info(model)
model = nn.parallel.DistributedDataParallel(model.cuda(),
device_ids=[args.gpu],
find_unused_parameters=True)