Difficulty using multiprocessing/num workers

Hi all,

I’ve been running my code with num_workers=0 up until recently. I want to see if I can speed things up a bit so I have been trying to test adjusting the number of processes. I had to set:
torch.multiprocessing.set_start_method('spawn') to allow me to run num_workers>0.

Now I may train my model. But between every batch I get the message:
[W CudaIPCTypes.cpp:22] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Unfortunately I haven’t been able to find much information about this but I think I’m violating some multithreading rules somewhere.

My dataset is in the form of ImageFolder and my dataloader is below:

def load_dataset(size_batch, size):
    data_path = "/home/bledc/dataset/test_set/kodak"

    transformations = transforms.Compose([
        transforms.Grayscale(num_output_channels=1),
        transforms.RandomCrop(size, pad_if_needed=True, padding_mode='reflect'),
        transforms.Resize(size),
        transforms.ToTensor()
        ])

    train_dataset = datasets.ImageFolder(
        root=data_path,
        transform=transformations
    )
    train_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=size_batch,
        shuffle=True,
        num_workers=0
    )

Below is my main:

if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn')
    device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
    if (device.type == "cuda"):
        torch.set_default_tensor_type('torch.cuda.FloatTensor')
    now = datetime.now()
    current_time = now.strftime("%H_%M_%S")

    path = "/home/bledc/my_remote_folder/denoiser/models/res18_broad_win16_mar29_noise_map{}".format(current_time)
    text_path = path+"/"+current_time+".txt"

    os.mkdir(path)
    txt_data = open(text_path, "w+")
    txt_data.close()

    width = 256
    # height = 256
    num_epochs = 1000
    batch_size = 1
    learning_rate = 0.001

    data_loader = load_dataset(batch_size, width)
    print(device)

    model = UNetWithResnet50Encoder()
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        model = nn.DataParallel(model)
    model.to(device)
    criterion =  MS_SSIM_L1_LOSS()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)

    loss = 9
    start_time = time.perf_counter()
    for i in range(0, num_epochs+1):
        train_loss = train_gray(i, data_loader, device, model, criterion, optimizer, i, path)

And lastly, my training loop:

def train_gray(epoch, data_loader, device, model, criterion, optimizer, i, path):
    train_loss = 0.0
)
    start_time = time.perf_counter()
    # torch.autograd.set_detect_anomaly(True)

    for data in data_loader:
        # torch.cuda.empty_cache()
        img, _ = data
        img = img.to(device)

        stand_dev = 0.0196
        noisy_img = add_noise(img, stand_dev, device)
        output = model(noisy_img, stand_dev)

        loss = criterion(output, img)

        optimizer.zero_grad()

        loss.backward()
        nn.utils.clip_grad_value_(model.parameters(), 50)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 50)
        optimizer.step()

        train_loss += loss.item()

        end_time = time.perf_counter()
        report_timer = end_time - start_time
        if(report_timer > 600):
            print('last_loss: {:.6f}, \t current epoch: {}'.format(
                loss.item(),
                epoch
                ))
            start_time = time.perf_counter()
    train_loss = train_loss/len(data_loader)

    return train_loss

If anyone would be able to point me in the right direction that would be greatly appreciated! Thank you.

I guess this line of code:

torch.set_default_tensor_type('torch.cuda.FloatTensor')

might be problematic, as it could use CUDA tensors inside the Dataset and thus inside each process, which could then fail due to using multiprocessing with multiple CUDA context instances.

Thanks @ptrblck that seems to have done the trick, sadly no huge time savings there but my curiosity has been satisfied. I think I need to create a new dataset with my transformations already applied. Thanks in general for all the useful info on the forum.