.eval()? Async Feature Extraction? -- wrt Transfer Learning tutorial

In Transfer Learning tutorial, the author shows a way to use a pre-trained model as a feature extractor. I have a few problems wrt this tutorial.

  1. Could we call .eval() on the extractor so that dropout (although not used in the tutorial) and BN performs consistently? More generally, would this be helpful for the performance (Acc. something)?

  2. Could we make this procedure async, aka, feature extractor deals with the (n+i)-th iteration via one GPU while the model is being trained in the n-th iteration in another GPU? When we have a HUGE model to train on top of the feature extractor, this will be quite useful.

I’ve just tried to solve 2. by threading. Might not be the best solution

  1. yes you can.

  2. Better to solve this with multiprocessing than with threading.

In Python 2.7, calling some_input_data.cuda() in sub-process would raise error. Is it just in Py 2.7?

yes, to use CUDA in a multiprocess setting you have to use python 3, because we need to leverage the spawn_server feature of multiprocessing. This is a limitation with CUDA, where forking is not compatible with CUDA initialization.

Currently trying to do it as a part of collate_fn. Not working quite well:
I use a lock to manage the access of gpu. However, after the first process releases the lock, the mem on gpu is NOT released, and therefore it causes OOM when the second process acquires the lock and start to work on gpu.

I’m trying this, but it does not release GPU mem after each _collate_fn2 call:

def collate_fn2(batch,dev_id,extractor,lock):
    imgs, targets=default_collate(batch) # in CPU
    input1=[ imgs[i:(i+mini_mini_batch_size)].clone()
          for i in range(0, len(imgs) , mini_mini_batch_size) ]
    out = []
    _collate_fn2(input1, out, dev_id, extractor)
    out = torch.cat(out, 0)
    input1 = None
    return out, targets

def _collate_fn2(input1,out,dev_id,extractor):
    # input1 = [mini-minibatch1, mini-minibatch2, ..., mini-minibatch4]
    for x in input1:
        x = Variable(x, volatile=True)
        x = x.cuda(dev_id)
        x = extractor(x)

class Collator(object):
    def __init__(self,dev_id,extractor,lock):
        self.dev_id = dev_id
        self.extractor = extractor
        self.lock = lock

    def __call__(self, batch):
        return collate_fn2(batch,self.dev_id,self.extractor,self.lock)

The log basically goes like this:

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493433037384/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory

Is there anyway to focus releasing the memory?