Image Tensors Return As Zero When num_workers > 0

jobayer · May 5, 2025, 5:05am

Hello, I am facing an issue with multiprocessing. I am trying to load my .pt data as dataloaders. Everything works fine when I set the num_workers = 0. But when I set it to a value greater than 0, the tensor values become zero. Below is my code to load the data. The shape of the tensors are all ok. Just the range of the images/tensors are [0, 0]. Would you please suggest what’s wrong here? Thank you in advance.

class SRDataset(Dataset):
    def __init__(self, hr_tensors, lr_tensors, transform=None):
        self.hr_tensors = hr_tensors
        self.lr_tensors = lr_tensors
        self.transform = transform

        assert len(self.hr_tensors) == len(self.lr_tensors), \
            "Number of HR and LR images must be equal"

    def __len__(self):
        return len(self.hr_tensors)

    def __getitem__(self, idx):
        hr_img = self.hr_tensors[idx]
        lr_img = self.lr_tensors[idx]

        if self.transform:
            hr_img = self.transform(hr_img)
            lr_img = self.transform(lr_img)

        return {'lr': lr_img, 'hr': hr_img}

def create_sr_datasets_and_loaders(hr_tensors, lr_tensors, batch_size=32, transform=None):

    full_dataset = SRDataset(hr_tensors, lr_tensors, transform)

    total_size = len(full_dataset)
    train_size = int(0.8 * total_size)
    val_size = int(0.1 * total_size)
    test_size = total_size - train_size - val_size

    train_dataset, val_dataset, test_dataset = random_split(
        full_dataset,
        [train_size, val_size, test_size],
        generator=torch.Generator().manual_seed(seed)
    )

    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=1,
        drop_last=True,
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=1,
        drop_last=True,
    )

    test_loader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=1,
        drop_last=True,
    )

    return {
        'datasets': {
            'train': train_dataset,
            'val': val_dataset,
            'test': test_dataset
        },
        'loaders': {
            'train': train_loader,
            'val': val_loader,
            'test': test_loader
        }
    }

data = create_sr_datasets_and_loaders(
    hr_tensors,
    lr_tensors,
    batch_size=64,
    transform=None
)

train_loader = data['loaders']['train']
val_loader = data['loaders']['val']
test_loader = data['loaders']['test']

print(f"Total images: {len(hr_tensors)}")
print(f"Training set: {len(data['datasets']['train'])} images")
print(f"Validation set: {len(data['datasets']['val'])} images")
print(f"Test set: {len(data['datasets']['test'])} images")

train_batch = next(iter(train_loader))
print(f"\nTrain batch LR shape: {train_batch['lr'].shape}")
print(f"Train batch HR shape: {train_batch['hr'].shape}")

lr_sample = train_batch['lr'][0]

hr_sample = train_batch['hr'][0]

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)

plt.imshow(lr_sample.squeeze().cpu().numpy(), cmap='gray')

plt.title(f"Low Resolution Image (Shape: {lr_sample.shape})")

plt.axis('off')

plt.subplot(1, 2, 2)

plt.imshow(hr_sample.squeeze().cpu().numpy(), cmap='gray')

plt.title(f"High Resolution Image (Shape: {hr_sample.shape})")

plt.axis('off')

plt.tight_layout()

plt.show()

print(f"LR Image - Min: {lr_sample.min().item():.4f}, Max: {lr_sample.max().item():.4f}")

print(f"HR Image - Min: {hr_sample.min().item():.4f}, Max: {hr_sample.max().item():.4f}")

Initially, it was having some problems loading the data when I set the num_workers > 0. Then LLMs suggested me to add the following lines of code and then it ran successfully. But still the images are all black.

import dill

import pickle

import torch.multiprocessing as mp

import multiprocessing.reduction as reduction

reduction.ForkingPickler.dumps = dill.dumps

mp.set_start_method("spawn", force=True)

torch.multiprocessing.set_sharing_strategy("file_system")

pickle.HIGHEST_PROTOCOL = 5

For your information, I am using Ubuntu 22.04.5 LTS WSL on Windows 11 and I am testing the code on a jupyter notebook, not a python file.

ptrblck · May 5, 2025, 1:32pm

What kind of errors were you running into before adding these lines suggested by the LLM?

jobayer · May 5, 2025, 1:36pm

This error:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

ptrblck · May 5, 2025, 6:29pm

It seems you are trying to load CUDATensors inside your Dataset. Did you try to load and process the data on the CPU first and move it to the GPU inside the training loop?

jobayer · May 6, 2025, 4:08am

Yes, I tried the following code which gave me the following error and the VS Code’s Jupyter server keeps loading (and loading… forever).

Code:

  def __getitem__(self, idx):
      hr_img = self.hr_tensors[idx]
      lr_img = self.lr_tensors[idx]

      if self.transform:
          hr_img = self.transform(hr_img).cpu()
          lr_img = self.transform(lr_img).cpu()

Error:

Total images: 484
Training set: 387 images
Validation set: 48 images
Test set: 49 images
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/jobayer/miniforge3/envs/sisr/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jobayer/miniforge3/envs/sisr/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'SRDataset' on <module '__main__' (built-in)>

I suspected that this was because I was running the code on a notebook file. Then I wrote standalone Python files for the codes, created a train.py file where I imported the datasets and other files in the __name__ == "__main__" block, and finally tried to show the shape of the datasets as I did on the notebook file. Then I ran the train.py file from the terminal which gave me the following error (not sure if this is an error. i googled it and saw somewhere that they were calling it a warning):

Total images: 484
Training set: 387 images
Validation set: 48 images
Test set: 49 images
Train batch LR shape: torch.Size([32, 1, 80, 80])
Train batch HR shape: torch.Size([32, 1, 240, 240])
[W506 09:59:15.333625697 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

And most importantly, the tensor range was still the same [0, 0].