Unknown reason for GPU0 memory usage after load_state_dict

I try to load the model to GPU3, but after loading the weights to the model, the model suddenly occupies memory in GPU0, which is unexpected.

Here is a fragment of the code. nvidia-smi is executed after each line to investigate where the GPU0 memory usage is introduced. (gpu_ids[0] is 3)

print('Before')
utils.print_nvidia_smi()
model.cuda(gpu_ids[0])

print('middle')
sleep(1.)
utils.print_nvidia_smi()

print('Loading weights from experiment ') #, metadata['expid'])
if best:
    model.load_state_dict(torch.load(model_best_path))
else:
    model.load_state_dict(torch.load(model_path))

print('After')
utils.print_nvidia_smi()

model.eval()

Before
Fri Jan  5 17:04:12 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:05:00.0 Off |                  N/A |
|  1%   44C    P8    14W / 250W |     10MiB /  8112MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 00000000:06:00.0 Off |                  N/A |
| 62%   75C    P2    59W / 250W |   7893MiB /  8114MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 00000000:09:00.0 Off |                  N/A |
|  0%   44C    P8    14W / 250W |   3359MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   39C    P8    14W / 250W |     10MiB /  8114MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      5274      C   python3                                     7883MiB |
|    2      9613      C   python3                                     3349MiB |
+-----------------------------------------------------------------------------+
middle
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      5274      C   python3                                     7883MiB |
|    2      9613      C   python3                                     3349MiB |
|    3     27365      C   python3                                      339MiB |
+-----------------------------------------------------------------------------+
Loading weights from experiment
After
...
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     27365      C   python3                                      339MiB |
|    1      5274      C   python3                                     7883MiB |
|    2      9613      C   python3                                     3349MiB |
|    3     27365      C   python3                                      339MiB |
+-----------------------------------------------------------------------------+

So after moving the model to GPU3, nvidia-smi correctly reports memory occupancy for GPU3, but after loading the weights to the model with torch.load and load_state_dict, the model is also using GPU0.
What am I doing wrong or is this a bug in pytorch?

I also tried the map_location of the torch.load function. Unfortunately I get the same memory usage results.

        model.l_out.load_state_dict(torch.load(model_path, map_location=lambda storage, loc: storage.cuda(gpu_ids[0])))