Data loader error as soon as I log out of remote machine

I have tried nohup, tmux and screen all three, but as soon as I logout of the remote machine, PyTorch dataloader dies. Here’s the error stack :-

Traceback (most recent call last):
  File "miika_method_construction_only.py", line 333, in <module>
    num_epochs=NUM_EPOCHS)
  File "miika_method_construction_only.py", line 158, in train_model
    for data in dset_loaders[phase]:
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 336, in __next__
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 357, in _process_next_batch
PermissionError: Traceback (most recent call last):
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 106, in _worker_loop
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 106, in <listcomp>
  File "/data/graphics/SpandanGraphsProject/Spandan_Experiments/Bayesian_Tool_Learning/neural_net/parameter_loader_all_tools.py", line 208, in __getitem__
    sample = self.loader(path)
  File "/data/graphics/SpandanGraphsProject/Spandan_Experiments/Bayesian_Tool_Learning/neural_net/parameter_loader_all_tools.py", line 257, in default_loader
    return pil_loader(path)
  File "/data/graphics/SpandanGraphsProject/Spandan_Experiments/Bayesian_Tool_Learning/neural_net/parameter_loader_all_tools.py", line 239, in pil_loader
    img = Image.open(f)
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/PIL/Image.py", line 2591, in open
  File "/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/PIL/Image.py", line 378, in preinit
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 780, in get_code
  File "<frozen importlib._bootstrap_external>", line 832, in get_data
PermissionError: [Errno 13] Permission denied: '/afs/csail.mit.edu/u/s/smadan/miniconda3/envs/py_36_tens_gpu/lib/python3.6/site-packages/PIL/BmpImagePlugin.py'

I made sure that all the files in python3.6 (and recursively all children) have read and write permissions using chmod -R 777.

Happening with a conda environment. PyTorch 0.4.1. Any leads at all?

Thanks,
Spandan

1 Like

The path here looks like you are using python from remote server/drive. How do you login to the server or how does it work? It depends on how your user drive access work in your institute I guess.

My wild guess is that when you login to a machine, your particular logical drive is plugged in automatically for you. When you log out, the logical drive is automatically removed as well.

Have you tried mosh?

Yep you were correct. This was happening because we use AFS here, and my ticket would get expired as soon as I log out. The correct way to use an internal system which gives you a longer ticket with continued access to AFS!

:wink: longtmux saves the day.