What is the shared memory?

ysleer · February 18, 2021, 3:25am

hi

I am trying to train a model using multiprocessing.

In the example below (Multiprocessing best practices — PyTorch 1.6.0 documentation), model.share_memory() is used.

import torch.multiprocessing as mp
from model import MyModel

def train(model):
    # Construct data_loader, optimizer, etc.
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters

if __name__ == '__main__':
    num_processes = 4
    model = MyModel()
    # NOTE: this is required for the ``fork`` method to work
    model.share_memory()
    processes = []
    for rank in range(num_processes):
        p = mp.Process(target=train, args=(model,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()

What exactly does shared memory mean??

Is a model created for each process??

Or is it that each process uses one model shared??

I don’t really understand what is in the document (IPC through shared memory - GeeksforGeeks).

thank you.

ptrblck · February 18, 2021, 6:04am

The Wikipedia article explains shared memory maybe a bit easier to understand.
It’s basically a memory pool, which can be used by multiple processes to exchange information and data.

DanielWang · October 2, 2021, 6:04pm

Is there any Pytorch Extention that supports GPU-based shared memory on CUDA?

ptrblck · October 2, 2021, 11:48pm

I’m not sure if I understand the question correctly, but you can directly use shared memory in your custom CUDA kernels called in the extension.

DanielWang · October 2, 2021, 11:51pm

Does this mean I can call

mode.shared_memory(); 
model.to(device); 

p1 = create_process()
p2 = create_process()

p1.model_train(model)
p2.model_train(model)

ptrblck · October 2, 2021, 11:52pm

No, the kernel-level shared memory is not the system shared memory used for IPC.
The former can be used in CUDA code as described here.

tengerye · October 4, 2021, 8:17am

@ptrblck So the model.share_memory() enables the cached model to be shared across different processes in a single GPU instead of multiple GPUs?

model1.to(device1)
model1.shared_memory()

p1 = create_process()
p2 = create_process()

p1.model_train(model1)
p2.model_train(model1)

model2= model1.clone().to(device2)
model2.shared_memory()

p3 = create_process()
p4 = create_process()

p3.model_train(model2)
p4.model_train(model2)

May I ask if the above snippet works fine?

ptrblck · October 4, 2021, 10:01am

tensor.share_memory_() will move the tensor data to shared memory on the host so that it can be shared between multiple processes. It is a no-op for CUDA tensors as described in the docs. I don’t quite understand the “in a single GPU instead of multiple GPUs” as this type of shared memory is not used on the GPU (i.e. it’s not the CUDA kernel-level shared memory).

tengerye · October 4, 2021, 1:48pm

I am still a little confused. Let’s say I have 2 GPUs and would like to run 2 processes on each GPU. All 4 of the processes are independent.

May I ask some questions please:

There is no need to use share_memory(), right?
I just need to clone 2 models for each GPU and run multiprocessing of pytorch. Is that correct?
Is there anything else I need to do?

ptrblck · October 4, 2021, 7:45pm

If all processes are independent, e.g. each process is training an independent model and is not using model sharding, data parallel etc. then you should just launch your processes on the desired device. Since GPU resources will be shared between processes, you would most likely see a slowdown compared to a single process using a single GPU.

tengerye · October 6, 2021, 2:43pm

@ptrblck Hi, thank you for your kind reply. In my case, I need to calculate the Hessian matrix of loss w.r.t the weights of model for each data point (using torch.autograd.grad()). There is no need update weights. May I ask do I need to explicitly clone the models for each processes to make sure the gradients will not be shared?

JuyiLin · July 1, 2023, 6:40pm

If you don’t want to share gradient, you need .detach()

SKTheLearner · October 10, 2023, 5:16am

@ptrblck

I am beginning my transition from TensorFlow to PyTorch!!
have a similar GPU question.
I am under Windows 10 with NVIDIA RTX A2000 GPU(4GB) with 64GB RAM(50% goes to shared memory-VRAM)…so effectively I expected 36(32+4) GB GPU memory in total.

When I just create tensors and commit to ‘cuda’, I can see it using shared memory also…
Then I created a model, committed to GPU , so my data as well as model in GPU,but it doesnt seem to use shared memory of 36GB while training the model…

using a Conv3d model, input is a 3d volume (1,256,256,128) of unint8 …training batch_size=2

Does PyTorch automatically use the shared memory, if you commit your model as wells as data to cuda? Is there anything I need to do, apart from using model.to as well tensor.to push every thing to GPU ?

ERROR**
return F.conv3d(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 15.51 GiB. GPU 0 has a total capacty of 4.00 GiB of which 0 bytes is free. Of the allocated memory 4.25 GiB is allocated by PyTorch, and 1.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

ptrblck · October 10, 2023, 5:21am

Could you describe what exactly you mean by “shared memory”?
Are you assuming the GPU will use its device memory as well as the host memory as unified memory?
If so, that’s not the case and you would need to use the device memory only (4GB) for your model parameters as well as GPU computation.
You could try to offload tensors to the CPU, but you would pay for the transfer each time.

SKTheLearner · October 10, 2023, 3:15pm

to start with …my bad…PyTorch model/data is using dedicated GPU memory as well as shared memory under Windows 10(and am sure it will 10+)…

On to Shared Memory , most of the intel laptops have onboard video/GPU chip…but they dont have anything called ‘memory’(we call it VRAM…Video Random…)…so they depend on system RAM… under Windows, by default, 50% of your RAM is reserved for it…ie, if you have 16GB RAM, 8 GB is reserved for VRAM…so 50%…get to the task manager under windows->performance…GPU…

So if you have dedicated GPU with dedicated memory of 4 GB and you have windows laptop with 64 GB RAM…your effective GPU memory available is (32GB+4GB)…36 GB which you should be able to use for your model/data…I can confirm PyTorch uses it… on the other hand , TensorFlow doesnt support GPU under windows after v 2.10…(hence my transition to PyTorch, as an aside)

ptrblck · October 10, 2023, 3:27pm

PyTorch doesn’t perform any of this offloading and I would be careful calling it shared memory as it usually refers to the shared memory in a CUDA kernel when talking about GPUs.
In any case, this might be a Windows-specific driver feature, which starts with page offloading once a threshold is met, but I’m not using Windows so don’t know if and when it happens.
You should however note that it will come with a large performance penalty.

SKTheLearner · October 10, 2023, 3:33pm

yes…50% thingy is Windows specific…my quest was to use that available memory to train on GPU(RAM,VRAM whatever…there is resource…so should be able to use it…) maybe a hit on performance… but for mere mortals like me cant afford an Nvidia A100(with 40-80GB memory)…, PyTorch is looking good with my computing resource as of now…and looks to be right up my ally…control on every single line of code you write…

here we go…PyTorch Model/Dataset …
1.BEFORE:::Before.JPG - Google Drive
2.Model Training Started: model_running.JPG - Google Drive