I am running this repo on my system ( MIG device , 80GB of memory) to train a classifier. I want to pass a list of weights to the loss function in train.py as below:
if __name__ == '__main__':
args = parse_args()
os.makedirs(args.model_dir, exist_ok=True)
os.makedirs(args.log_dir, exist_ok=True)
os.environ['CUDA_VISIBLE_DEVICES'] = args.CUDA_VISIBLE_DEVICES
if args.enet_type == 'resnest101':
ModelClass = Resnest_Melanoma
elif args.enet_type == 'seresnext101':
ModelClass = Seresnext_Melanoma
elif 'efficientnet' in args.enet_type:
ModelClass = Effnet_Melanoma
else:
raise NotImplementedError()
DP = len(os.environ['CUDA_VISIBLE_DEVICES']) > 1
set_seed()
device = torch.device('cuda')
c_weights = torch.tensor([0.954, 0.8274, 0.852, 0.987, 0.967, 0.986, 0.735, 0.687]).float().to(device)
#loss function
criterion = nn.CrossEntropyLoss()
main()
When I create the tensor on CPU it works just fine (I need to use a GPU because of my image). But when I send the tensor c_weights
to the GPU the train process gets stuck without returning any errors. I tried different methods for creating the tensor and different dtypes it did not work. This is what it shows in the output (the process gets stuck on 2 and does not progress):
I tried smaller batch sizes as well. No improvement.
Here is my GPU information:
GPU Driver Version: 510.85.02
CUDA Version: 11.6
memory: 81069MiB
How can I solve this issue?
Edit:
My args are:
--kernel-type = 8c_b3_768_512_18ep
--data-folder = 512
--image-size = 512
--enet-type = efficientnet_b3
--batch-size = 32
--num-workers = 32
--out-dim = 8
--CUDA_VISIBLE_DEVICES = MIG-cdc1351f-1b7a-554c-a273-f7643f99523f
--fold = 0