What is the shared memory?

hi

I am trying to train a model using multiprocessing.

In the example below (Multiprocessing best practices — PyTorch 1.6.0 documentation), model.share_memory() is used.

import torch.multiprocessing as mp
from model import MyModel

def train(model):
    # Construct data_loader, optimizer, etc.
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters

if __name__ == '__main__':
    num_processes = 4
    model = MyModel()
    # NOTE: this is required for the ``fork`` method to work
    model.share_memory()
    processes = []
    for rank in range(num_processes):
        p = mp.Process(target=train, args=(model,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

What exactly does shared memory mean??

Is a model created for each process??

Or is it that each process uses one model shared??

I don’t really understand what is in the document (IPC through shared memory - GeeksforGeeks).

thank you.

1 Like

The Wikipedia article explains shared memory maybe a bit easier to understand.
It’s basically a memory pool, which can be used by multiple processes to exchange information and data.

Is there any Pytorch Extention that supports GPU-based shared memory on CUDA?

I’m not sure if I understand the question correctly, but you can directly use shared memory in your custom CUDA kernels called in the extension.

Does this mean I can call

mode.shared_memory(); 
model.to(device); 

p1 = create_process()
p2 = create_process()

p1.model_train(model)
p2.model_train(model)

No, the kernel-level shared memory is not the system shared memory used for IPC.
The former can be used in CUDA code as described here.

@ptrblck So the model.share_memory() enables the cached model to be shared across different processes in a single GPU instead of multiple GPUs?

model1.to(device1)
model1.shared_memory()

p1 = create_process()
p2 = create_process()

p1.model_train(model1)
p2.model_train(model1)

model2= model1.clone().to(device2)
model2.shared_memory()

p3 = create_process()
p4 = create_process()

p3.model_train(model2)
p4.model_train(model2)

May I ask if the above snippet works fine?

tensor.share_memory_() will move the tensor data to shared memory on the host so that it can be shared between multiple processes. It is a no-op for CUDA tensors as described in the docs. I don’t quite understand the “in a single GPU instead of multiple GPUs” as this type of shared memory is not used on the GPU (i.e. it’s not the CUDA kernel-level shared memory).

I am still a little confused. Let’s say I have 2 GPUs and would like to run 2 processes on each GPU. All 4 of the processes are independent.

May I ask some questions please:

  1. There is no need to use share_memory(), right?
  2. I just need to clone 2 models for each GPU and run multiprocessing of pytorch. Is that correct?
  3. Is there anything else I need to do?

If all processes are independent, e.g. each process is training an independent model and is not using model sharding, data parallel etc. then you should just launch your processes on the desired device. Since GPU resources will be shared between processes, you would most likely see a slowdown compared to a single process using a single GPU.

1 Like

@ptrblck Hi, thank you for your kind reply. In my case, I need to calculate the Hessian matrix of loss w.r.t the weights of model for each data point (using torch.autograd.grad()). There is no need update weights. May I ask do I need to explicitly clone the models for each processes to make sure the gradients will not be shared?