Segmentation Fault/Timeout using Dataloaders

Hi all,

I’m currently working on a Pix2Pix Gan and I’m running into unexpected problems on my Ubuntu Linux machine (24GB GPU + 16 core CPU).
My dataset class does nothing else than loading images and masks of the disk. If a mask is not found, an empty one with a certain size is generated. Other parameters for the class are various transformation objects for augmenting the data.
However, my problem is that after the Validation Dataloader has been run once, the Training Dataloader gets stuck and sometimes throws a segmentation error. I am using Pytorch 1.12.1 and the number of workers is 4 for the Training Dataloader and 0 for the Validation Dataloader.
If I also set the number to 0 on the Training Dataloader everything works as it should.

init_transform_train, init_transform_test, transform_only_input, transform_only_target = get_transforms()

    # split between training and validation
    total_dataset = Pix2PixDataset(root_dir=args.data_dir, init_transform_train=init_transform_train,
                                       init_transform_test=init_transform_test,
                                       transform_only_input=transform_only_input,
                                       transform_only_target=transform_only_target,
                                       target_image_size=256)

    rs = ShuffleSplit(n_splits=1, test_size=.1, random_state=123)

    train_indices, val_indices = next(rs.split(total_dataset))
    train_dataset = torch.utils.data.Subset(total_dataset, train_indices)
    val_dataset = torch.utils.data.Subset(total_dataset, val_indices)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
                              num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=1, shuffle=False)

    loop = tqdm(train_loader, leave=True)
    for idx, (x, y) in enumerate(loop):
        x, y = x.to(device)
    loop = tqdm(train_loader, leave=True)
    for idx, (x, y) in enumerate(loop):
        x, y = x.to(device)
    loop = tqdm(val_loader, leave=True)
    for idx, (x, y) in enumerate(loop):
        x, y = x.to(device)
    loop = tqdm(train_loader, leave=True) 
    for idx, (x, y) in enumerate(loop): # <- seems to stuck here
        x, y = x.to(device)

As you can see in the image below, the 4th call is not executed. I am not using cv2 (which is sometimes mentioned as root cause for such problems) and for image loading I am using PIL.

On my Windows machine I am not able to replicate the problem. Do you have an idea what might be a problem? When I am switching roles of the train and val loaders I get a similar problem (I am not allowed to post another media item but I hope that the following text might give the impression)
Val-Loader successful
Val-Loader successful
Train-Loader stucks

Edit:
The problem also occurs when training and validation indices are equivalent!

Edit2:
After a some hours of debugging I might have identified the problem.
One of my image transformations stops working:

import albumentations as A

init_transform_train = A.Compose(
        [
            A.Resize(width=default_config['TARGET_IMAGE_SIZE'], height=default_config['TARGET_IMAGE_SIZE']),
            A.PadIfNeeded(min_height=default_config['ZERO_PADDED_IMAGE_SIZE'],
                          min_width=default_config['ZERO_PADDED_IMAGE_SIZE'],
                          value=0,
                          p=1.0,
                          border_mode=cv2.BORDER_CONSTANT),
            A.HorizontalFlip(p=0.5),
            A.VerticalFlip(p=0.5),
            ToTensorV2()
        ], additional_targets={'image0': 'image'}
    )

I have no clue why but this transformation does not work when it is used in the second epoch for the first time. Might be a bug.

Albumentations blocks CPU by spawning multiple threads · Issue #1246 · albumentations-team/albumentations · GitHub has “solved” the problem