While trying to execute sample in distr app tutorial, coding error occurs, can not fix

GGPYTORCH000 · May 28, 2024, 6:29am

In this tutorial, I tried executing but random() call returns error:
https://pytorch.org/tutorials/intermediate/dist_tuto.html

(search for random() in that page above and try executing it).

I tried many ways to replace with equivalent but can not find substitiute such that code runs error-free. It is poorly written such that does not tell whcih library it should come from.

H-Huang · May 28, 2024, 2:43pm

I’m guessing it should be this? random — Generate pseudo-random numbers — Python 3.12.3 documentation

Updated the tutorials here: Include import for dist_tuto by H-Huang · Pull Request #2880 · pytorch/tutorials · GitHub. Feel free to send in PRs for other issues you find.

GGPYTORCH000 · May 28, 2024, 4:34pm

Do you own this tutorial or just suggesting by sidelines. if former, you should test your code before publishing

GGPYTORCH000 · May 28, 2024, 4:42pm

so how should i change the code? random() causes error.

H-Huang · May 28, 2024, 4:50pm

from random import Random

GGPYTORCH000 · May 28, 2024, 6:39pm

that might have worked but errors continue to occur with this sample code, fixed number of it and now seeing:

init_process: rank, size: 0 , 2
init_process: rank, size: 1 , 2
run: rank/size: 1 / 2
run: rank/size: 0 / 2
GG: size/bsz: 2 / 64
GG: size/bsz: 2 / 64
Process Process-1:
Traceback (most recent call last):
File “/home/miniconda3/envs/root-test/lib/python3.9/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/home/miniconda3/envs/root-test/lib/python3.9/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/2-distributed-data-parallelism/dist-app-training.py”, line 160, in init_process
fn(rank, size)
File “/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/2-distributed-data-parallelism/dist-app-training.py”, line 139, in run
num_batches = ceil(len(train_set.dataset) / float(bsz))
NameError: name ‘ceil’ is not defined
Process Process-2:
Traceback (most recent call last):
File “/home/miniconda3/envs/root-test/lib/python3.9/multiprocessing/process.py”, line 315, in _bootstrap
self.run()
File “/home/miniconda3/envs/root-test/lib/python3.9/multiprocessing/process.py”, line 108, in run
self._target(*self._args, **self._kwargs)
File “/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/2-distributed-data-parallelism/dist-app-training.py”, line 160, in init_process
fn(rank, size)
File “/root/extdir/gg/git/codelab/gpu/ml/pytorch/distributed/tutorials/2-distributed-data-parallelism/dist-app-training.py”, line 139, in run
num_batches = ceil(len(train_set.dataset) / float(bsz))
NameError: name ‘ceil’ is not defined

GGPYTORCH000 · May 28, 2024, 6:43pm

ok fixed everything appears working but not before fixing obvious and some less obvious errors.