I’m trying to train multiple models using the same dataset on multiple GPUs all within one script. This is partly an exercise to help me understand parallel processing in pytorch.
There is a complete script to run this at the bottom of the post. What I’m doing is I’m creating 15 networks and a bunch of copies of the dataset and moving them to different GPUs to train in parallel. I’m using multiprocessing’s Pool class with 10 workers. I’ve intentionally assigned the models in such a way as to use the lower numbered GPUs first, to make things easier as I watch -x nvidia-smi
.
when I first run the script I see a bunch of processes get created on all the GPUS:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:1E:00.0 Off | N/A |
| 30% 34C P2 65W / 250W | 802MiB / 11019MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:3D:00.0 Off | N/A |
| 29% 33C P2 45W / 250W | 802MiB / 11019MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:3E:00.0 Off | N/A |
| 29% 32C P2 64W / 250W | 730MiB / 11019MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce … On | 00000000:3F:00.0 Off | N/A |
| 29% 32C P2 63W / 250W | 540MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA GeForce … On | 00000000:40:00.0 Off | N/A |
| 29% 33C P2 66W / 250W | 540MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 54849 C python3 799MiB |
| 1 N/A N/A 54849 C python3 799MiB |
| 2 N/A N/A 54849 C python3 733MiB |
| 3 N/A N/A 54849 C python3 537MiB |
| 4 N/A N/A 54849 C python3 537MiB |
±----------------------------------------------------------------------------+
The first batch of 10 starts running, and so far, so good:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:1E:00.0 Off | N/A |
| 29% 39C P2 117W / 250W | 3673MiB / 11019MiB | 75% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:3D:00.0 Off | N/A |
| 30% 38C P2 92W / 250W | 3673MiB / 11019MiB | 83% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:3E:00.0 Off | N/A |
| 29% 36C P2 113W / 250W | 3673MiB / 11019MiB | 78% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce … On | 00000000:3F:00.0 Off | N/A |
| 29% 34C P2 66W / 250W | 1759MiB / 11019MiB | 22% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA GeForce … On | 00000000:40:00.0 Off | N/A |
| 29% 35C P2 66W / 250W | 802MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 54849 C python3 799MiB |
| 0 N/A N/A 54909 C /usr/bin/python3 957MiB |
| 0 N/A N/A 54913 C /usr/bin/python3 957MiB |
| 0 N/A N/A 54914 C /usr/bin/python3 957MiB |
| 1 N/A N/A 54849 C python3 799MiB |
| 1 N/A N/A 54910 C /usr/bin/python3 957MiB |
| 1 N/A N/A 54912 C /usr/bin/python3 957MiB |
| 1 N/A N/A 54915 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54849 C python3 799MiB |
| 2 N/A N/A 54906 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54908 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54911 C /usr/bin/python3 957MiB |
| 3 N/A N/A 54849 C python3 799MiB |
| 3 N/A N/A 54907 C /usr/bin/python3 957MiB |
| 4 N/A N/A 54849 C python3 799MiB |
±----------------------------------------------------------------------------+
But when the next batch of 5 starts to run, the memory used by the first batch of 10 isn’t freed up and a number of other processes show up on node 0 (the “main” node):
±----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:1E:00.0 Off | N/A |
| 30% 42C P2 68W / 250W | 7406MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA GeForce … On | 00000000:3D:00.0 Off | N/A |
| 29% 41C P2 45W / 250W | 3673MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA GeForce … On | 00000000:3E:00.0 Off | N/A |
| 29% 40C P2 65W / 250W | 3673MiB / 11019MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA GeForce … On | 00000000:3F:00.0 Off | N/A |
| 29% 41C P2 117W / 250W | 3673MiB / 11019MiB | 77% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA GeForce … On | 00000000:40:00.0 Off | N/A |
| 29% 38C P2 123W / 250W | 3673MiB / 11019MiB | 95% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 54849 C python3 799MiB |
| 0 N/A N/A 54906 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54907 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54908 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54909 C /usr/bin/python3 957MiB |
| 0 N/A N/A 54910 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54911 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54912 C /usr/bin/python3 533MiB |
| 0 N/A N/A 54913 C /usr/bin/python3 957MiB |
| 0 N/A N/A 54914 C /usr/bin/python3 957MiB |
| 0 N/A N/A 54915 C /usr/bin/python3 533MiB |
| 1 N/A N/A 54849 C python3 799MiB |
| 1 N/A N/A 54910 C /usr/bin/python3 957MiB |
| 1 N/A N/A 54912 C /usr/bin/python3 957MiB |
| 1 N/A N/A 54915 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54849 C python3 799MiB |
| 2 N/A N/A 54906 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54908 C /usr/bin/python3 957MiB |
| 2 N/A N/A 54911 C /usr/bin/python3 957MiB |
| 3 N/A N/A 54849 C python3 799MiB |
| 3 N/A N/A 54907 C /usr/bin/python3 957MiB |
| 3 N/A N/A 54909 C /usr/bin/python3 957MiB |
| 3 N/A N/A 54914 C /usr/bin/python3 957MiB |
| 4 N/A N/A 54849 C python3 799MiB |
| 4 N/A N/A 54907 C /usr/bin/python3 957MiB |
| 4 N/A N/A 54910 C /usr/bin/python3 957MiB |
| 4 N/A N/A 54913 C /usr/bin/python3 957MiB |
±----------------------------------------------------------------------------+
I’m trying to figure out what those other processes are, where they come from, and how to prevent them from being created.
Code to recreate the problem:
import torch
import torch.multiprocessing as tmp
from torch.multiprocessing import Pool as MPPool
from typing import Union, Dict, Any, Optional
class EasyDS(torch.utils.data.Dataset):
def __init__(self, x: torch.tensor, y: torch.tensor):
self.x = x
self.y = y
def __len__(self):
return self.y.shape[0] - 100
def __getitem__(self, item):
return self.x[item:item + 100].T, self.y[item + 100]
def move_to_cuda(obj: Union[torch.nn.Module, torch.tensor],
to_str: str) -> Union[torch.nn.Module, torch.tensor]:
"""
Move to cuda.
"""
return obj.to(to_str)
def interpret_use_cuda(use_cuda: Union[bool, int]):
if torch.cuda.is_available():
if isinstance(use_cuda, bool):
if use_cuda:
return 'cuda:0' # use zero by default
else:
return 'cpu' # False is an int!
if isinstance(use_cuda, int) and use_cuda < torch.cuda.device_count():
return 'cuda:{:d}'.format(use_cuda)
return 'cpu'
class LSTMNet(torch.nn.Module):
def __init__(self,
input_dim: int,
hidden_dim: int,
num_layers: int,
fc_size: int,
n_hidden_fc: int = 1,
output_dim: int = 1,
dropout: float = 0.,
use_cuda: Union[bool, int] = False
):
super().__init__()
self.hidden = None # to save the hidden state
self.params: Dict[str, Any] = \
{'input_dim': input_dim, # number of input channels
'hidden_dim': hidden_dim, # number of hidden dimensions
'num_layers': num_layers,
'fc_size': fc_size,
'n_hidden_fc': n_hidden_fc,
'output_dim': output_dim, # usually 1
'dropout': dropout,
'use_cuda': use_cuda
}
self.cuda_str = interpret_use_cuda(use_cuda)
self.lstm = torch.nn.LSTM(input_size=self.params['input_dim'],
hidden_size=self.params['hidden_dim'],
num_layers=self.params['num_layers'],
dropout=0.0,
batch_first=True # easy use of Dataset
)
# Linear layers that go from LSTM to the single output
self.fc_layers = torch.nn.ModuleList()
for i in range(self.params['n_hidden_fc'] + 2):
inf = self.params['fc_size']
outf = self.params['fc_size']
if i == 0:
inf = self.params['hidden_dim']
if i == self.params['n_hidden_fc'] + 1:
outf = self.params['output_dim']
self.fc_layers.append(
torch.nn.Linear(in_features=inf, out_features=outf, bias=False)
)
self.fc_dropout = torch.nn.Dropout(p=self.params['dropout'])
self._move_to_cuda()
def forward(self, x: torch.tensor, h: Optional[torch.tensor] = None):
""" x is input, h is hidden state """
x = torch.permute(x, [0, 2, 1]) # conv shape -> lstm shape
if h is None:
lstm_out, self.hidden = self.lstm(x)
else:
lstm_out, self.hidden = self.lstm(x, h)
# y = self.fc_dropout(lstm_out)
y = self.fc_layers[0](lstm_out)
for fc_layer in self.fc_layers[1:]:
y = torch.nn.functional.leaky_relu(y) # baseline
y = self.fc_dropout(y)
y = fc_layer(y)
return y
def num_params(self):
return sum(p.numel() for p in self.parameters() if p.requires_grad)
def _move_to_cuda(self):
"""
Move to cuda if requested at class creation and cuda is available.
"""
move_to_cuda(self, self.cuda_str)
def make_gpu_assigment_list(n_nets: int, n_gpus: int) -> list[int]:
n_assigned = n_nets // n_gpus
n_assigned += 1 if n_nets / n_gpus - n_assigned > 0. else 0
gpu_assignments = []
for g in range(n_gpus - 1):
gpu_assignments.extend([g] * n_assigned)
gpu_assignments.extend(
[n_gpus - 1] * (n_nets - len(gpu_assignments))
)
return gpu_assignments
def spawn_default_network(n_features: int, use_cuda: bool = False):
return LSTMNet(
input_dim=n_features,
hidden_dim=9,
num_layers=1,
fc_size=8,
n_hidden_fc=1,
output_dim=1,
dropout=0.0,
use_cuda=use_cuda)
def do_training(network: LSTMNet, dataset: torch.tensor, n_epochs: int = 20):
dl = torch.utils.data.DataLoader(
dataset,
batch_size=128,
shuffle=False
)
loss_func = torch.nn.MSELoss(reduction='mean')
optim = torch.optim.Adam(network.parameters(), lr=1e-3)
for e in range(n_epochs):
for x, y in dl:
optim.zero_grad()
outputs = network(x)[:, -1, :]
loss = loss_func(outputs, y.reshape(-1, 1))
loss.backward()
optim.step()
return loss.detach().cpu().item()
def pool_with_starmap(
nets: list[LSTMNet],
datas: list[EasyDS],
num_mp_workers: int = 10
):
things_to_train = [(a, b) for a, b in zip(nets, datas)]
with MPPool(processes=num_mp_workers) as pool:
res = pool.starmap(do_training, things_to_train, chunksize=1)
return res
if __name__ == "__main__":
tmp.set_start_method('spawn') # required to copy over tensors (parallel)
num_gpus = torch.cuda.device_count()
num_nets = 15
x = torch.rand(50000, 3)
y = torch.rand(50000)
gpu_assignments = ['cuda:{:d}'.format(a) for a in
make_gpu_assigment_list(num_nets, num_gpus)]
print(gpu_assignments)
datasets = [
EasyDS(x.to(a), y.to(a)) for a in gpu_assignments
]
networks = [spawn_default_network(n_features=x.shape[-1],
use_cuda=True).to(a)
for a in gpu_assignments]
results = pool_with_starmap(networks, datasets)