Access data of a GPU process from main

TT_YY · September 8, 2020, 8:40am

Hi.

I have a program using distributed data parallel of Pytorch.
It’s working well, but I do not know how to access data in the GPU process from main().

In particular, a GPU process, train(), produces a list of loss, and I want to plot it in the main() after returning from spawn(). However, I do not know how to access the list in train() on GPU from main() on CPU.

If I use a global variable, it should work, but it does not seem to be the best answer. I understand that printing loss can be done by gpu[0], and maybe even plotting the graph too. But, I want to do many tasks to analyze the results in main().

I appreciate any information or examples. Thank you.

TT_YY · September 10, 2020, 4:45am

I have checked communication using args and global variables.
Non of them works for transmitting data from a GPU process to main(), when I use DDP.

Usually, the arguments hold pointers to variables, and main() and called functions can share the same variables. But, when the processes are spawned in DDP, it seems that args are deep-copied and there is no common variables having the same ids between main() and the processes.

A global variable can be declared in the process, but it’s not shared with main().

Does everybody using DDP plot the loss charts in a spawned process on GPU?
I have no idea how to send the loss list from the process to main().

Please advise.

mrshenli · September 15, 2020, 3:18am

This is true, because Python global vars are per-process concept.

Does everybody using DDP plot the loss charts in a spawned process on GPU?

This can be done using torch.multiprocessing.SimpleQueue. E.g., let the main process create the queue, pass it to the child process, and then let the child process put the loss object to the queue. Then, the main process should be able to see that.

The test below can serve as an example:

github.com

pytorch/pytorch/blob/2c4b4aa81bc8dba8272e9c7190edcaa3e114ec15/test/test_multiprocessing.py#L580-L600


      
          def test_event_multiprocess(self):
              event = torch.cuda.Event(enable_timing=False, interprocess=True)
              self.assertTrue(event.query())
          
              ctx = mp.get_context('spawn')
              p2c = ctx.SimpleQueue()
              c2p = ctx.SimpleQueue()
              p = ctx.Process(
                  target=TestMultiprocessing._test_event_multiprocess_child,
                  args=(event, p2c, c2p))
              p.start()
          
              c2p.get()  # wait for until child process is ready
              torch.cuda._sleep(50000000)  # spin for about 50 ms
              event.record()
              p2c.put(0)  # notify child event is recorded
          
              self.assertFalse(event.query())
              c2p.get()  # wait for synchronization in child
              self.assertTrue(event.query())
              p.join()

seungjun · September 15, 2020, 8:10am

I’m not sure if my case is what you want but I use GPU communication functions to plot loss graph during training.

I use a custom Loss class that inherits nn.modules.loss._Loss.
It calculates the loss and stores the record and plots the loss graph.
The loss values are synchronized inside the Loss class
Thus, I don’t have to send values to main() scope.

Here’s my public github code.

all_reduce function performs sync and it is called at the end of an epoch from trainer.

TT_YY · September 16, 2020, 8:25am

Thank you, Shen Li.

This is be what I was looking for. I will try the sample code and try to use SimpleQueue() in my program. I hope it works!

Thank you.

TT_YY · September 16, 2020, 8:46am

Thank you, seungjun

I appreciate your code.
I understand that all_reduce function transmits loss between GPUs.

I am trying to evaluate performance of optimizers by changing many factors such as batch size and learning rate. Therefore, I have multiple loops to change the variables outside the optimization loop. Also, I have to collect many kinds of data along with loss and accuracy to analyze the details of the optimizers and plot them.

I felt a kind of odd about performing such non-multiplication looping tasks and all plotting using expensive GPU and its memory. So, I tried to implement all the outer loops, analysis, and plotting tasks in main(), which requires getting the information from the spawned GPU processes.

Thank you.