DataLoader froze after load a patch of tensor

Rickyim · March 20, 2018, 9:42am

I use the torch.utils.data.DataLoader to iterate through the training set. And the the program froze after it loads a batch of data. here is my code for the DataSet Class:

class DeconvDataSet(Dataset):

    def __init__(self, gt_dir, tr_dir, start, length):
        self.gt_dir=gt_dir #directory for groundtruth
        self.tr_dir=tr_dir #directory for training set
        self.start=start  #start index(the directory contains far more samples than I want to use)
        self.length=length  #data set length

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        gtFile_path=os.path.join(self.gt_dir,'gt%d.npy'%(idx+self.start))
        trFile_path=os.path.join(self.tr_dir,'tr%d.npy'%(idx+self.start))
        gt=np.load(gtFile_path) #each sample is a 3D numpy array
        tr=np.load(trFile_path)
        print('load%d'%idx)
        print(tr.shape)
        return {'gt': torch.from_numpy(gt), 'tr': torch.from_numpy(tr)}

this is where I use the dataLoader

dataLoader=DataLoader(dataSet, batch_size=8, shuffle=True, num_workers=1)
print('start training...')
for epoch in range(2):
    print('start epoch %d' %epoch)
    for i_batch, sample in enumerate(dataLoader):
        print('read the data')
        input,target=sample['tr'].type(torch.FloatTensor), sample['gt'].type(torch.FloatTensor)
        if torch.cuda.is_available():
            input, target=input.unsqueeze(1).cuda(), target.unsqueeze(1).cuda()
        else:
            input, target=input.unsqueeze(1), target.unsqueeze(1)
        input, target=Variable(input), Variable(target)
        #feed data into the net
        optimizer.zero_grad()
        print('put the data into net')
        output=net(input)
        #define loss function
        loss = criterion(output, target)
        loss = loss*1000
        print('back propagate')

        loss.backward()
        optimizer.step()
        print('iter %d,  mse %.3f' %(i_batch, loss))

and finally, I got the output like this:

loading dataset
start training…
start epoch 0
load656
(30, 200, 100)
load156
(30, 200, 100)
load800
(30, 200, 100)
load847
(30, 200, 100)
load807
(30, 200, 100)
load299
(30, 200, 100)
load415
(30, 200, 100)
load33
(30, 200, 100)

and stuck here.

By the way, I use it in a docker

when I ps aux

it seems I have a high virtual memory usage

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
zhoutk 39451 1.7 0.9 75496188 1241632 pts/0 Sl+ 17:33 0:08 python deconv.py

SimonW · March 21, 2018, 6:39am

It could be shm problem. What pytorch version are you using and how much shm do you have on the docker? Docker default shm is really low.

Rickyim · March 21, 2018, 6:57am

The version of Pytorch is 0.3.0 post4
/proc/meminfo show:

MemTotal: 131748088 kB
MemFree: 32858344 kB
MemAvailable: 124043064 kB
Buffers: 70512 kB
Cached: 89183512 kB
SwapCached: 77736 kB
Active: 43803028 kB
Inactive: 49002496 kB
Active(anon): 3116556 kB
Inactive(anon): 764844 kB
Active(file): 40686472 kB
Inactive(file): 48237652 kB
Unevictable: 112548 kB
Mlocked: 112548 kB
SwapTotal: 8387580 kB
SwapFree: 7533240 kB
Dirty: 12 kB
Writeback: 0 kB
AnonPages: 3438624 kB
Mapped: 1279120 kB
Shmem: 317048 kB
Slab: 4115636 kB
SReclaimable: 2598468 kB
SUnreclaim: 1517168 kB
KernelStack: 43664 kB
PageTables: 83284 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 74261624 kB
Committed_AS: 65001116 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1048844 kB
VmallocChunk: 34291571748 kB
HardwareCorrupted: 0 kB
AnonHugePages: 1517568 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 129799556 kB
DirectMap2M: 4304896 kB
DirectMap1G: 2097152 kB

What is more, when I change the num_workers in DataLoader from 1 to 0. It seems that it can run now!

The shared memory is indeed low. So is the problem with the shared memory?

SimonW · March 21, 2018, 1:25pm

Yeah, shared memory is used to transfer Tensor fds in dataloader when num_workers > 0. So it being too low might cause data loader workers to be killed by bus error, and thus causing the main process stuck. We added some warning mechanism in 0.3.1 for this case.

Rickyim · March 27, 2018, 7:20am

Thank you for explanation. BTW what does fds stand for exactly?

SimonW · March 27, 2018, 1:56pm

file descriptor s sorry for not being clear