Hi!
I’m experiencing one or more than one problem with my training.
First problem:
training freeze:
Experienced at random even after hours of training (up to 12h, 5 epochs).
After it happens the cpu/gpu usage is very low but the process is still running.
No warning, or errors.
Second problem:
training shutdown:
Experienced only one time after trying to restart training.
error:
malloc(): mismatching next->prev_size (unsorted)
Aborted (core dumped)
Unfortunately I’ve modified a lot of component of the model since the last functioning version, so I’m not able to identify the source exactly. I suspect it has something to do with a modified version of collate_fn.
def default_collate_mod(batch):
r"""Puts each data field into a tensor with outer dimension batch size"""
elem = batch[0]
elem_type = type(elem)
if isinstance(elem, torch.Tensor):
out = None
elem_size = elem.size()
if not all(elem.shape == elem_size for elem in batch):
#you can feed this to a linear layer and then separate again in batches
return ([el.shape[0] for el in batch], torch.cat(batch))
if torch.utils.data.get_worker_info() is not None:
# If we're in a background process, concatenate directly into a
# shared memory tensor to avoid an extra copy
numel = sum([x.numel() for x in batch])
storage = elem.storage()._new_shared(numel)
out = elem.new(storage)
return torch.stack(batch, 0, out=out)
elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
and elem_type.__name__ != 'string_':
if elem_type.__name__ == 'ndarray' or elem_type.__name__ == 'memmap':
# array of string classes and object
if np_str_obj_array_pattern.search(elem.dtype.str) is not None:
raise TypeError(default_collate_err_msg_format.format(elem.dtype))
return default_collate_mod([torch.as_tensor(b) for b in batch])
elif elem.shape == (): # scalars
return torch.as_tensor(batch)
elif isinstance(elem, float):
return torch.tensor(batch, dtype=torch.float64)
elif isinstance(elem, int_classes):
return torch.tensor(batch)
elif isinstance(elem, string_classes):
return batch
elif isinstance(elem, container_abcs.Mapping):
return {key: default_collate_mod([d[key] for d in batch]) for key in elem}
elif isinstance(elem, tuple) and hasattr(elem, '_fields'): # namedtuple
return elem_type(*(default_collate_mod(samples) for samples in zip(*batch)))
elif isinstance(elem, container_abcs.Sequence):
# check to make sure that the elements in batch have consistent size
it = iter(batch)
elem_size = len(next(it))
if not all(len(elem) == elem_size for elem in it):
#Don't display the warning, just return the list
return batch
transposed = zip(*batch)
return [default_collate_mod(samples) for samples in transposed]
raise TypeError(default_collate_err_msg_format.format(elem_type))
versions:
pytorch 1.7
cuda 11.1
Edit:
It appears that there is also some problem with the memory of my GPU, maybe what caused the second error?
The amount of free memory looks lower than expected
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 On | 00000000:06:00.0 On | N/A |
| 42% 47C P2 28W / 160W | 3724MiB / 5931MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1093 G /usr/lib/xorg/Xorg 100MiB |
| 0 N/A N/A 1516 G /usr/bin/plasmashell 73MiB |
| 0 N/A N/A 2506 G /usr/lib/firefox/firefox 2MiB |
| 0 N/A N/A 2557 G /usr/lib/firefox/firefox 2MiB |
| 0 N/A N/A 3736 G /usr/lib/firefox/firefox 2MiB |
| 0 N/A N/A 5249 G /usr/lib/firefox/firefox 2MiB |
| 0 N/A N/A 5366 G /usr/lib/firefox/firefox 2MiB |
| 0 N/A N/A 5423 G /usr/lib/firefox/firefox 2MiB |
+-----------------------------------------------------------------------------+
thanks