Random training freeze/ shutdown

Hi!
I’m experiencing one or more than one problem with my training.
First problem:
training freeze:
Experienced at random even after hours of training (up to 12h, 5 epochs).
After it happens the cpu/gpu usage is very low but the process is still running.
No warning, or errors.

Second problem:
training shutdown:
Experienced only one time after trying to restart training.
error:

malloc(): mismatching next->prev_size (unsorted)
Aborted (core dumped)

Unfortunately I’ve modified a lot of component of the model since the last functioning version, so I’m not able to identify the source exactly. I suspect it has something to do with a modified version of collate_fn.

def default_collate_mod(batch):
    r"""Puts each data field into a tensor with outer dimension batch size"""

    elem = batch[0]
    elem_type = type(elem)

    if isinstance(elem, torch.Tensor):
        out = None
        elem_size = elem.size()
        if not all(elem.shape == elem_size for elem in batch):
           #you can feed this to a linear layer and then separate again in batches
           return ([el.shape[0] for el in batch], torch.cat(batch))
        if torch.utils.data.get_worker_info() is not None:
            # If we're in a background process, concatenate directly into a
            # shared memory tensor to avoid an extra copy
            numel = sum([x.numel() for x in batch])
            storage = elem.storage()._new_shared(numel)
            out = elem.new(storage)
        return torch.stack(batch, 0, out=out)
    elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
            and elem_type.__name__ != 'string_':
        if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
            # array of string classes and object
            if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
                raise TypeError(default_collate_err_msg_format.format(elem.dtype))

            return default_collate_mod([torch.as_tensor(b) for b in batch])
        elif elem.shape == ():  # scalars
            return torch.as_tensor(batch)
    elif isinstance(elem, float):
        return torch.tensor(batch, dtype=torch.float64)
    elif isinstance(elem, int_classes):
        return torch.tensor(batch)
    elif isinstance(elem, string_classes):
        return batch
    elif isinstance(elem, container_abcs.Mapping):
        return {key: default_collate_mod([d[key] for d in batch]) for key in elem}
    elif isinstance(elem, tuple) and hasattr(elem, '_fields'):  # namedtuple
        return elem_type(*(default_collate_mod(samples) for samples in zip(*batch)))
    elif isinstance(elem, container_abcs.Sequence):
        # check to make sure that the elements in batch have consistent size
        it = iter(batch)
        elem_size = len(next(it))
        if not all(len(elem) == elem_size for elem in it):
           #Don't display the warning, just return the list
           return batch
        transposed = zip(*batch)
        return [default_collate_mod(samples) for samples in transposed]

    raise TypeError(default_collate_err_msg_format.format(elem_type))

versions:
pytorch 1.7
cuda 11.1

Edit:
It appears that there is also some problem with the memory of my GPU, maybe what caused the second error?
The amount of free memory looks lower than expected

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 2060    On   | 00000000:06:00.0  On |                  N/A |
| 42%   47C    P2    28W / 160W |   3724MiB /  5931MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1093      G   /usr/lib/xorg/Xorg                100MiB |
|    0   N/A  N/A      1516      G   /usr/bin/plasmashell               73MiB |
|    0   N/A  N/A      2506      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      2557      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      3736      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      5249      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      5366      G   /usr/lib/firefox/firefox            2MiB |
|    0   N/A  N/A      5423      G   /usr/lib/firefox/firefox            2MiB |
+-----------------------------------------------------------------------------+

thanks

Check if your memory is stable with some memtest program.

Do you mean ram memory or GPU memory?
I’m actually encountering a Segmentation Fault (core dumped). My fear is that It’s related to this problem https://github.com/pytorch/pytorch/issues/31758.
I’m running the script in debug, I’ll update the post as soon as i get the error.
Also, I got the error just by running an empty training loop, just loading the data into RAM.

System ram issues could manifest as all kinds of weird/random errors, your malloc error is suspicious in that regard. Could also only happen when power supply is under stress.

gdb reports No stack.
I’ll run a ram memtest.
Do you have any advice on the settings of the memtester, MB and number of iteration?
I have 32 GB of ram.

that’s program specific, just run it for some time (at least 5 minutes, I’d guess)

I got this, can this be the reason of my issue?

Loop 37/50:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : testing  41FAILURE: 0xffffffffffffffff != 0xffffffffffffdfff at offset 0x00f5dbf8.
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : ok         
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Loop 45/50:
  Stuck Address       : ok         
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok         
  Block Sequential    : ok         
  Checkerboard        : ok         
  Bit Spread          : ok         
  Bit Flip            : testing  76FAILURE: 0xffffffffffffddff != 0xfffffffffffffdff at offset 0x003f0c00.
  Walking Ones        : ok         
  Walking Zeroes      : ok         
  8-bit Writes        : ok
  16-bit Writes       : ok

Yes. Try playing with BIOS memory settings (CL/frequency), otherwise you’ll need to find problematic hardware part.

I recently added some ram and checked the BIOS settings, probably it was defective from the start.
Thanks for your help.

I am unfortunately back, this time with nothing wrong on memtest.
The names of the files printed below are due to this part of the code and they are printed just to be sure that was not a probelm related to corrupted data (i can load those file just fine):

try:
    with lz4.frame.open(f'{self.root_dir}/{name}', 'rb') as f:
    data = pickle.load(f)
except:
    print(name)

It gets stuck without doing anything, then when i stop the program:

Traceback (most recent call last):
  File "train_AD.py", line 332, in <module>
sample_5383326.lz4
sample_12725641.lz4
sample_4159972.lz4
sample_2696410.lz4
    train_net(args,cfg, DEBUG, model)
  File "train_AD.py", line 211, in train_net
    for data in train_dataloader:
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
    success, data = self._try_get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

If particular files fail, it may be some (bit) corruption not detected by pickle. Otherwise, try using non-zero timeout argument for DataLoader and make sure exceptions are not swallowed up (for example, with jupyter stderr is not printed in browser, I think). You may also check if this is multiprocessing related by disabling workers.

I’m just running the script in the terminal, do i have to worry about exception being swallowed up?
Also I’m not sure about the way I coded try except, Is there a specific exception I should handle to avoid other important exception to be missed?

Edit: the error below is probably just due to not having enough ram. I will see if some error happens with 0 workers.

With 8 workers the error below happens within 10 seconds, probably the time required to load a single batch, not sure if is the same error as the previous one.
With 0 workers I got no error within 16 minutes of execution. I will leave it running and see if something happens.
I don’t feel like It is guaranteed to never happen even after hours and hours of training, since I previously managed to run the code for several hours with 4 workers.
I haven’t experimented enough with this, but It is clear that I will never fetch the first batch with 8 workers, so it looks strongly dependent with the number of workers.

Traceback (most recent call last):
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 88, in rebuild_tensor
    t = torch._utils._rebuild_tensor(storage, storage_offset, size, stride)
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/_utils.py", line 133, in _rebuild_tensor
    return t.set_(storage, storage_offset, size, stride)
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4157) is killed by signal: Killed. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train_AD.py", line 333, in <module>
    train_net(args,cfg, DEBUG, model)
  File "train_AD.py", line 211, in train_net
    for data in train_dataloader:
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in _get_data
    success, data = self._try_get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4157) exited unexpectedly

Below with a timeout different from zero:

Traceback (most recent call last):
  File "train_AD.py", line 334, in <module>
    train_net(args,cfg, DEBUG, model)
  File "train_AD.py", line 212, in train_net
    for data in train_dataloader:
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1021, in _get_data
    raise RuntimeError('DataLoader timed out after {} seconds'.format(self._timeout))
RuntimeError: DataLoader timed out after 1 seconds

It is either OOM or corrupted unpickled data. In the latter case, it is hard to guess the reason, assuming that files are fine.

That is OOM. To be honest I think every error is OOM.
The problem is kinda solved (I hope) by reducing the number of workers to 3 (12 h of training right now).
With 3 workers I have 5 GB of free memory, not really sure why it would need so much free memory, but with 4 workers it crashes after hours.
It’s not a memory leak either, since the memory usage doesn’t change overtime.

I don’t think It’s worth investigating any further, since it works fine with 3 workers and I wouldn’t even know where to start.
Thanks for the help.

I’m sorry for making this issue more confusing than It’s supposed to be.
At the end with 3 workers I had to restart my PC, because even the UI freezed.
I’m currently running the script with 0 workers, and for now It’s working. With 3 workers I was also experiencing lag spike, and what it looked like a UI refresh (I don’t know how to better describe it).
Following what you advised me, here it is what happens if i I use timeout in the dataloader:

Traceback (most recent call last):
  File "train_AD.py", line 332, in <module>
    train_net(args,cfg, DEBUG, model)
  File "train_AD.py", line 215, in train_net
    for inputs,x_2,target_availabilities,target_positions,targert_yaws in train_dataloader:
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
    idx, data = self._get_data()
  File "/home/lorenzo/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1021, in _get_data
    raise RuntimeError('DataLoader timed out after {} seconds'.format(self._timeout))
RuntimeError: DataLoader timed out after 3 seconds

I suggested timeouts only because they reveal worker crashes, if you also have memory leaks or something, you should investigate that, you’re probably time-outing as system goes out of resources (swap would be my first guess).

RAM and GPU memory are pretty much half free. They don’t seem to increase overtime. Also the only thing I do in the dataloader is load a lz4 archive and modify numpy arrays.
The freezing happened to me before, just because the ram was using swap memory, this doesn’t seem to be the case though.
Going from 3 workers and batch 128 to 4 workers and batch 32 doesn’t seem to help with workers being timed-out.

Could be GPU issue with growing memory fragmentation or shared use (i.e. for display output). Worker timeouts are probably just sympthoms of something, I’d set that to 0 or something like 60, and just monitor system performance.

Just the process of loading the data causes workers being timed-out. They aren’t loaded on GPU yet.

Ah, I see. If you have a reproducible epoch one hanging, it should be possible to localize it by attaching a debugger (python or gdb), or maybe just with tracing messages. If it is a genuine deadlock, it can be hard to find a reason. But maybe it is some stupid linux quota setting or something. I can only guess at this point, sorry.