Is is possible to run model at one GPU in parallel and then gather variables?

I’m trying to implement dynamic_decode with pytorch 0.3. Due to uncertain recursive times, I would like to run the model in parallel and then gather the outputs like that.

def f(model, input, queue):
    queue.put(model(input))

for i in range(3):
    p[i] = Process(f, args=(model, input[i], queue[i])

... # run the processes...

for q in queue:
     output[i] = q.get()

while Variable has no attribute share_memory_().

It seems to be a feature brought by version 0.4 but I still wonder if there is any correct way to do that with 0.3.1. As a comparison, tensorflow uses a while_loop which supports multi workers.