Hi,
I transformed the ImageNet dataset into LMDB database. During training, I find that the GPU Util is around 99% stably at the beginning of training, but it jumps between 99% and 0% after hundreds of iterations, as a result, the speed of training slows down significantly.
In addition, I have run the code several times before, but such phenomenon has never happened. The code has not been changed. Dataloader is shown below.
class train_dataset(Dataset):
def __init__(self):
super(train_dataset,self).__init__()
self.root = r'/data/imagenet/train/'
self.env = lmdb.open(self.root)
self.txn = self.env.begin(write = False)
self.transforms = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize
])
# print(self.__len__())
def __getitem__(self,i):
image_bin = self.txn.get((str(i)+'_img').encode())
# another way to open binary data, but got the same phenomenon
# image = Image.open(BytesIO(image_bin))
# if image.mode != 'RGB':
# image = image.convert('RGB')
image_buf = np.frombuffer(image_bin,dtype = np.uint8)
image = cv2.imdecode(image_buf,cv2.IMREAD_COLOR)
image = Image.fromarray(cv2.cvtColor(image,cv2.COLOR_BGR2RGB))
image = self.transforms(image)
label = self.txn.get((str(i)+'_label').encode()).decode()
label = int(label)
return {'image':image,'label':label}
def __len__(self):
return self.txn.stat()['entries']//2
loader_train = torch.utils.data.DataLoader(
train_dataset(), batch_size=bs_train, shuffle=True,
num_workers=n_worker, pin_memory=True)
What’s the possible causes?
Sincerely