Dataset creation for noisy data

Hi,

I am trying to create a noisy dataset for ML. Here’s what I did:

mnist_train = MNIST('../data/MNIST', download = True,
                    transform = transforms.Compose([
                        transforms.ToTensor(),
                    ]), train = True)

mnist_test = MNIST('../data/MNIST', download = True,
                    transform = transforms.Compose([
                        transforms.ToTensor(),
                    ]), train = False)

And this:

def add_noise_sp(image,sd,amount=0.2):
    np.random.seed(seed=sd)

    low_clip = 0.13
    std = 0.31

    image = np.asarray(image)
    out = image.copy()
    p = amount
    q = 0.5 
    flipped = np.random.choice([True, False], size=image.shape,
                                   p=[p, 1 - p])
    salted = np.random.choice([True, False], size=image.shape,
                                  p=[q, 1 - q])
    peppered = ~salted
    out[flipped & salted] = low_clip-2*std
    
    return torch.tensor(out)

s = 1
class SyntheticNoiseDatasetsalttr(Dataset):
    def __init__(self, data, mode='train'):
        self.mode = mode
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        global s
        img = self.data[index][0]
        s = s + 1 
        return add_noise_sp(img,s), img
    
b = 3425
class SyntheticNoiseDatasetsaltte(Dataset):
    def __init__(self, data, mode='test'):
        self.mode = mode
        self.data = data
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        global b
        img = self.data[index][0]
        b = b + 1 
        return add_noise_sp(img,b), img

This how I create the dataset:

noisy_mnist_train = SyntheticNoiseDatasetsalttr(mnist_train, 'train')
noisy_mnist_test = SyntheticNoiseDatasetsaltte(mnist_test, 'test')
train_set, val_set = torch.utils.data.random_split(noisy_mnist_train, [55000, 5000], generator=torch.Generator().manual_seed(42))

I thought this would give me a constant dataset. Does this mean that when I train my neural net, it will see different noisy version of the same input everytime? When I run the following command, I get digit 5 but everytime it has different noise on it. It seems like everytime I call DataLoader, it’s running my datageneration code and giving me different noisy data of the same subject.

torch.manual_seed(123)
bt = DataLoader(val_set,batch_size=1, shuffle= True)
bt =next(iter(bt))
noisy,clean = bt
plt.imshow(noisy.squeeze(), cmap='gray')

The posted code doesn’t show the repeated calls, but I assume you are just executing the 5 lines of code in a REPL multiple times. If so, then the different noise levels would be expected, since you are using global variables for the seeds (s and b), which are updated in each call to __getitem__.

Hi @ptrblck,

Thanks for the reply. I mean if I execute the last block multiple of times, I can get a different output. I thought by running the code below my dataset will be generated.

noisy_mnist_train = SyntheticNoiseDatasetsalttr(mnist_train, 'train')
noisy_mnist_test = SyntheticNoiseDatasetsaltte(mnist_test, 'test')
train_set, val_set = torch.utils.data.random_split(noisy_mnist_train, [55000, 5000], generator=torch.Generator().manual_seed(42))

Then that dataset is fixed. When I run the block below any number of times just by hitting shift+enter in jupyter notebook I should get the same output.

torch.manual_seed(123)
bt = DataLoader(val_set,batch_size=1, shuffle= True)
bt =next(iter(bt))
noisy,clean = bt
plt.imshow(noisy.squeeze(), cmap='gray')

All I want is to generate noisy data with the seed value s and b.

You are adding noise while getting every new item in these lines. If you want a fixed noise, one way is to add noise to tensors manually and use TensorDataset.

Yes, your datasets will be recreated, but note that you are explicitly using global seeds, which will not be recreated.
As mentioned before, either reset the seed inside the datasets or use @InnovArul approach, in case you want to add a fixed noise to each sample.

For each picture, I want different noise. However, every time my dataset is created I want the same noise for the picture (hence I am trying to set seed for each picture). So the problem is the global seed, right? This global seed does not get reset, right? Thanks for the suggestions, I have never used TensorDataset but I will look into TensorDataset

How can I reset the seed inside the datasets?

You could e.g. set the initial seed in the __init__ method and update it in the __getitem__.
This would make sure to initialize each new dataset with the same seed.

I did this:

class SyntheticNoiseDatasetsalttr(Dataset):
    def __init__(self, data,mode='train'):
        self.mode = mode
        self.data = data
        self.s = 1
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        img = self.data[index][0]
        self.s = self.s + 1 
        return add_noise_sp(img,self.s), img

It seems like I am making a mistake here because I get the same result as before: every time I run the code I get different noise:
1st run:
image
2nd run:
image

All I want is different noise in different images (controlled by the seed) but it should be the same every time I execute the code.

I cannot reproduce the issue using:

def add_noise_sp(image,sd,amount=0.2):
    np.random.seed(seed=sd)

    low_clip = 0.13
    std = 0.31

    image = np.asarray(image)
    out = image.copy()
    p = amount
    q = 0.5 
    flipped = np.random.choice([True, False], size=image.shape,
                                   p=[p, 1 - p])
    salted = np.random.choice([True, False], size=image.shape,
                                  p=[q, 1 - q])
    peppered = ~salted
    out[flipped & salted] = low_clip-2*std
    
    return torch.tensor(out)

class SyntheticNoiseDatasetsalttr(torch.utils.data.Dataset):
    def __init__(self, data,mode='train'):
        self.mode = mode
        self.data = data
        self.s = 1
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        img = self.data[index][0]
        self.s = self.s + 1 
        return add_noise_sp(img,self.s), img
    
mnist_train = torchvision.datasets.MNIST(
    root='./data', transform=torchvision.transforms.ToTensor())
noisy_mnist_train = SyntheticNoiseDatasetsalttr(mnist_train, 'train')
train_set, val_set = torch.utils.data.random_split(noisy_mnist_train, [55000, 5000], generator=torch.Generator().manual_seed(42))

torch.manual_seed(123)
bt = torch.utils.data.DataLoader(val_set,batch_size=1, shuffle= True)
bt =next(iter(bt))
noisy,clean = bt
plt.imshow(noisy.squeeze(), cmap='gray')

and get always the same output by rerunning the entire code.
I don’t know which part you are rerunning, but make sure to reinitialize the dataset.