Run pytorch on jupyter notebook

ph0123 · April 29, 2020, 11:16am

Hi,

I try to run example from tutorial with “GLoo” backend and Point to Point communication.

"""run.py:"""
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    tensor = torch.zeros(1)
    if rank == 0:
        tensor += 1
        # Send the tensor to process 1
        dist.send(tensor=tensor, dst=1)
    else:
        # Receive tensor from process 0
        dist.recv(tensor=tensor, src=0)
    print('Rank ', rank, ' has data ', tensor[0])

def init_process(rank, size, fn, backend='gloo'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 2
    processes = []
    for rank in range(size):
        p = Process(target=init_process, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
    print("done")

When I run it, only “done” is printed on jupyter notebook.
How to run it with python?
Thanks,

mrshenli · April 29, 2020, 2:21pm

I tried this with colab, but cannot reproduce this problem. Sometimes there are weird behavior when using multiprocessing in notebook. If you directly launch this program using command line, are the outputs as expected?

ph0123 · April 29, 2020, 6:08pm

Yes. It work with python.
But I wanna ask about run it on jupyter.
How do it work on Jupyter? This is my question.
Thanks,

mrshenli · April 29, 2020, 6:42pm

As I cannot reproduce the error on my Jupyter notebook, I can only guess why the message from subprocess is not shown. Given that the main process prints “done”, I would assume the sub-processes are launched correctly. But since the subprocess didn’t print the message, it could be either 1) sub-process crashed 2) sub-process is not printing to stdout. For 1), you can check the exitcode of the subprocess, adding more logs will also help. For 2) you will need check local configures to see if it is redirected, or you explicitly redirect that print to file.

ph0123 · April 29, 2020, 6:58pm

Hi,
Thank you so much!

ph0123 · May 5, 2020, 9:02am

Hi all,

I find the solution for that.
I run jupyter on macbook, and It worked.
On Window, the program only printed “done”.

Thanks,

mrshenli · May 5, 2020, 2:10pm

PyTorch distributed package does not support Windows yet. So most likely the subprocess crashed as init_process_group is not available on Windows.