Problem in building my own MNIST custom dataset

Feng · August 28, 2018, 4:43pm

Hi everyone! I am having problems in reading MNIST training csv file (size: 60000* 785, the first column is the label, where I downloaded through the link: https://pjreddie.com/projects/mnist-in-csv/ ) from my local computer.

The following is the detail information, could somebody help me to identify what is the cause, thanks in advance!

I am running Pytorch in Win10 with pytorch-0.4.1, python 3.6.6;
My source codes:

''' -*- coding: utf-8 -*-'''
import torch 
import torch.nn as nn
from skimage import transform
import torchvision.transforms as transforms
from torch.autograd import Variable
import numpy as np
from torch.utils.data import Dataset

''' Ignore warnings'''
import warnings
warnings.filterwarnings("ignore")

''' Hyper Parameters''' 
input_size = 784
hidden_size1 = 500
hidden_size2 = 300
num_classes = 10
num_epochs = 5
batch_size = 100
learning_rate = 0.001


class NosiyMNISTDataset(Dataset):
    
    def __init__(self):
        xy = np.loadtxt('./data_MNIST/mnist_train.csv',
                        delimiter=',', dtype=np.float32)
        self.len = xy.shape[0]
        self.x_data = torch.from_numpy(xy[:, 1:])
        self.y_data = torch.from_numpy(xy[:, [0]])

    def __getitem__(self, index):
        return (self.x_data[index], self.y_data[index])

    def __len__(self):
        return self.len
    

''' MNIST Dataset '''
train_dataset = NosiyMNISTDataset()
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True,
                                           num_workers = 2)

''' Neural Network Model (2 hidden layer)'''
class Net(nn.Module):
    def __init__(self, input_size, hidden_size1, hidden_size2, num_classes):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size1) 
        self.fc2 = nn.Linear(hidden_size1, hidden_size2) 
        self.fc3 = nn.Linear(hidden_size2, num_classes)
        self.relu = nn.ReLU() 
        # self.dropout = nn.Dropout() 
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        # out = self.dropout(out) 
        out = self.fc2(out)
        out = self.relu(out)
        # out = self.dropout(out)
        out = self.fc3(out)
        return out


if __name__ == '__main__':   
  net = Net(input_size, hidden_size1, hidden_size2, num_classes)
  net.cuda() 
''' Loss and Optimizer'''
  criterion = nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(net.parameters(), lr = learning_rate)  

'''Train the Model'''
  for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader,0):  
        # Convert torch tensor to Variable
        images = Variable(images.view(-1, 28*28).cuda())
        labels = Variable(labels.cuda())
       
        
        # Forward + Backward + Optimize
        optimizer.zero_grad()  # zero the gradient buffer
        outputs = net(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
     
        if (i+1) % 100 == 0:
            print ('Epoch [%d/%d], Step [%d/%d], Loss: %.4f' 
                   %(epoch+1, num_epochs, i+1, len(train_dataset)//batch_size, loss.data[0]))

The error I got is:

"Traceback (most recent call last):
  File "Local_MNIST.py", line 92, in <module>
    loss = criterion(outputs, labels)
 ................
RuntimeError: Expected object of type torch.cuda.LongTensor but found type torch.cuda.FloatTensor for argument #2 'target'"

I have tried
<1> . change the “dtype” in my NosiyMNISTDataset class to ‘np.float’, ‘np.long’, and ‘np.int64’
<2>. add long() to torch.from_numpy(xy[:, [0]]) as torch.from_numpy(xy[:, [0]]).long()
<3> remove cuda() from my codes

All the three ways do not solve my problem.

P.S. I know there is a standard way to laod MNIST dataset, however, I would like to modify dataset that’s why I am using csv file.

Looking forward to you guys comments.

Thanks!

ptrblck · August 28, 2018, 4:55pm

The second approach should have worked.
Could you add the .long() cast again and print the type of labels in your training loop?

PS: I’ve formatted your code as it was a bit difficult to read. You can add code using three backticks `.

Feng · August 28, 2018, 5:10pm

@ptrblck_de Thank you for formatting my code in the post.

I have tried to add the .long() to my NosiyMNISTDataset class as

                       self.y_data = torch.from_numpy(xy[:, [0]]).long()

and add a print function under the training loop
labels = Variable(labels.cuda())
print(labels.shape)

it gave “torch.Size([100,1])” (matches with my batch size).

However, it still has an error as follows:
“RuntimeError: multi-target not supported at c:\programdata\miniconda3\conda-bld\pytorch_1533090623466\work\aten\src\thcunn\generic/ClassNLLCriterion.cu:15”

ptrblck · August 28, 2018, 5:13pm

The current error message seems to be unrelated to the type, but rather to the shape.
Try to remove dim1 with labels = labels.squeeze() and run it again.

Feng · August 28, 2018, 5:22pm

Yes, I tried to print the standard way of loading MNIST dataset, yes, indeed the label size is
torch.Size([100]).

After I added the code “labels = labels.squeeze()” , it works!!!

I really appreciate your help!!

A_9 · November 3, 2021, 3:08pm

Hi everyone
I have the same problem, and I can’t solve it. I want to use my own MNIST data set.
I have two .csv files for train and test data and labels (I mean that in each .csv file, the first column is the labels and other columns from 2 to 785 are pixels of images).

read the data

df_train = pd.read_csv(‘Train_Data_FS_with_Label.csv’,header=None)
df_test = pd.read_csv(‘Test_Data_FS_with_Label.csv’,header=None)

get the image pixel values and labels

train_labels = df_train.iloc[:, 0]
train_images = df_train.iloc[:, 1:]
test_labels = df_test.iloc[:, 0]
test_images = df_test.iloc[:, 1:]

define transforms

transform = transforms.Compose(
[transforms.ToPILImage(),
transforms.RandomCrop(24),
transforms.ToTensor()
])

custom dataset

class MNISTDataset(Dataset):
def init(self, images, labels, transforms):
self.X = images
self.y = labels
self.transforms = transforms
def len(self):
return (len(self.X))
def getitem(self, i):
data = self.X.iloc[i, :]
data = np.asarray(data).astype(np.uint8).reshape(24,24, 1)
if self.transforms:
data = self.transforms(data)
if self.y is not None:
return (data, self.y[i])
else:
return data
train_data = MNISTDataset(train_images, train_labels, transform)
test_data = MNISTDataset(test_images, test_labels, transform)

dataloaders

trainloader = DataLoader(train_data, batch_size=1, shuffle=True)
testloader = DataLoader(test_data, batch_size=1, shuffle=True)

and I need to split the test data set into 2 parts:
test_ds, valid_ds_before = torch.utils.data.random_split(testloader , (9500, 500))
small_shared_dataset = create_shared_dataset(valid_ds_before, 200)

In the “create_shared_dataset” function:
def create_shared_dataset(valid_ds, size):
data_loader = DataLoader(valid_ds, batch_size=1)
for idx, (data, target) in enumerate(data_loader): (error occur here!!)
…

my code has error in this part: “‘DataLoader’ object is not subscriptable”

how can I solve this error??
I would be very grateful for any help you can give me.

ptrblck · November 3, 2021, 8:28pm

Use random_split on Dataset instances, not on the DataLoader object.
Once you’ve created the datasets, wrap it once into a DataLoader and it should work.

A_9 · November 4, 2021, 4:13pm

It works
I really appreciate your help