Loading input/output folders with .npy extension

Hi, I am new in using pytorch and in ML in general.

I am building encoder-decoder architecture, similar to autoencoder, but input and output are different numpy arrays.
I have two folders “input” and “output”, where input and output pairs are store in .npy format.
At the moment I called each pair of input and output with the same name.
So, for example, input/output files of pair 1 has same name, but stored in different folders.

I am not sure how to use how to use DatasetFolder, so my dataset will become X,Y which I will be able to split in train/test and feed to the model.


Based on your description I think implementing a custom Dataset would be the best approach. In its __init__ method you could store the folder paths and load each pair in the __getitem__ using the corresponding index.

Thanks prtblck!

I did as you have advised, I am getting error, which I could not figure out by googling, so let me ask in this post.

I have my numpy arrays in the shape (1, 42, 58), as mentioned in two folders with identical names for each pair.
I am getting following error now:
linear(): argument ‘input’ (position 1) must be Tensor, not list
Could you please advise what causing it?

class En_De_coder_dataset(Dataset):
    def __init__(self, in_dir, out_dir):
        self.in_dir = in_dir
        self.out_dir = out_dir
    def __len__(self):
        return len(os.listdir(self.in_dir))

    def __getitem__(self, index):
        input = torch.from_numpy(np.load(self.in_dir + '/grid_aug_' + str(index) + '.npy')),float()
        output = torch.from_numpy(np.load(self.out_dir + '/grid_aug_' + str(index) + '.npy')),float()

        return (input, output)

dataset = En_De_coder_dataset(in_dir = 'augmented_inputs', out_dir = 'augmented_outputs')

train_set, test_set = torch.utils.data.random_split(dataset, [100, 44])

train_loader = DataLoader(dataset=train_set, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_set, batch_size=64, shuffle=True)

class En_De_coder(nn.Module):
    def __init__(self):
        self.encoder = nn.Sequential(
            nn.Linear(28 * 28, 128), # (N, 784) -> (N, 128)
            nn.Linear(128, 64),
            nn.Linear(64, 12),
            nn.Linear(12, 3) # -> N, 3
        self.decoder = nn.Sequential(
            nn.Linear(3, 12),
            nn.Linear(12, 64),
            nn.Linear(64, 128),
            nn.Linear(128, 28 * 28),

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

model = En_De_coder()

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(),

num_epochs = 10
outputs = []
for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        recon = model(data)
        loss = criterion(recon, targets)

print(f'Epoch:{epoch+1}, Loss:{loss.item():.4f}')
outputs.append((epoch, targets, recon))

my data looks like this

Your __getitem__ method seems to have a typo as I guess you want to transform the tensors to float32 and not create a tuple with a 0.0 as the second item:

    def __getitem__(self, index):
        input = torch.from_numpy(np.load(self.in_dir + '/grid_aug_' + str(index) + '.npy')),float()
        output = torch.from_numpy(np.load(self.out_dir + '/grid_aug_' + str(index) + '.npy')),float()

        return (input, output)

Replace the comma with a dot and it should work I guess.

Thanks @ptrblck , you are awesome!!!
It works now!!!

1 Like