Training CNN with my own dataset

PKUFlyingPig · April 24, 2020, 12:49pm

I want to train a CNN with my own dataset, and I follow the guidance to define my own data set use torchvision.utils.data.Dataset. I create a txt file which contains all the paths to all my training data, and in the getitem function it opens an image file and returns the image, but I find this way too slow, maybe because in each epoch it will open this image again and read it. So should I read all the images at a time(e.g. define a global variable trainX which shaped (trainingdata_num, feature, w ,h) )? or maybe there are some other clever way?

below is my current codes
the variable self.data contains all the paths of the training data
‘’’
class MyDataset(Data.Dataset):
def init(self,txt_path,img_path,transform=None,have_label=True):
self.img_path=img_path
self.have_label=have_label
self.transform=transform
data=[]
with open(txt_path) as f:
for line in f:
if self.have_label == True:
data.append((line.strip(),int(line.split("_")[0])))
else :
data.append(line.strip())
self.data=data

def getitem(self,index):
if self.have_label == True:
name,label=self.data[index]
else :
name=self.data[index]
img=Image.open(os.path.join(self.img_path,name)).convert(‘RGB’)
img=img.resize((128,128))
if self.transform is not None:
img=self.transform(img)
if self.have_label == True:
return img,label
else :
return img

def len(self):
return len(self.data)
‘’’

ptrblck · April 25, 2020, 5:52am

If your dataset is small and fits into your RAM, you might pre-load it of course and store the images in e.g. a list.
Note however, that his approach would add a startup latency, since you are loading all the files before the training can begin.
The lazy loading approach is used for larger datasets, which don’t fir into your memory or if you want to avoid the startup latency.

Also note, that you would have to transform the tensors to PIL.Images and back to tensors in case you are storing the inputs as tensors.