I just tried installing and running with nightly updates. First, I got some error about my distributed launch. The command I typically use to run my code is
time CUDA_VISIBLE_DEVICES=1 python -m torch.distributed.launch --nproc_per_node=1 --use_env train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1
However, I’m currently not running distributed while debugging and have args.distributed defaulted to False. This command gives a lot of output and seems to be trying to run my code multiple times, which I’m wondering if this is somehow causing my errors? There’s a lot of output including INFO and WARNings, but you can see ERRORs as well.
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py:164: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
"The module torch.distributed.launch is deprecated "
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 1
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_h69nlzad/none_lpixji6r
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_0/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
what(): linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f76dbd97302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f76dbd93c9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f76dbd9418e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f7524c9a268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f7566bac423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 81719) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_1/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
what(): linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4be06e0302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f4be06dcc9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f4be06dd18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f4a27e66268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f4a69d78423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82044) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_2/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
what(): linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6bede91302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f6bede8dc9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f6bede8e18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f6a36d94268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f6a78ca6423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82370) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0]
role_ranks=[0]
global_ranks=[0]
role_world_sizes=[1]
global_world_sizes=[1]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_h69nlzad/none_lpixji6r/attempt_3/0/error.json
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
what(): linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb431662302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7fb43165ec9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7fb43165f18e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7fb27a565268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7fb2bc477423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82746) of binary: /home/mcever/.virtualenvs/tchnite/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0006990432739257812 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "82746", "role": "default", "hostname": "mind", "state": "FAILED", "total_run_time": 40, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [1]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "mind", "state": "SUCCEEDED", "total_run_time": 40, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 82746 (local_rank 0) FAILED (exitcode -6)
Error msg: Signal 6 (SIGABRT) received by PID 82746
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
*************************************************
train.py FAILED
=================================================
Root Cause:
[0]:
time: 2021-05-21_09:52:12
rank: 0 (local_rank: 0)
exitcode: -6 (pid: 82746)
error_file: <N/A>
msg: "Signal 6 (SIGABRT) received by PID 82746"
=================================================
Other Failures:
<NO_OTHER_FAILURES>
*************************************************
real 0m40.962s
user 1m6.204s
sys 0m51.743s
Seeing this, I decided to try without any distributed.launch, which is something I haven’t done with this code because I wanted to maintain the ability to run in a distributed manner, but if that is the cost of reproduciblity, I think I can afford to run on a single GPU. So I tried the following command
time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1
With the nightly build, I still get an error:
not using distributed mode
Namespace(aspect_ratio_group_factor=3, aug_pct=0.5, batch_size=1, data_path='', dataset='mare', device='cuda', dist_url='env://', distributed=False, epochs=10, lr=0.1, lr_gamma=0.99, lr_step_size=8, lr_steps=[16, 22], model='fasterrcnn_resnet50_fpn', momentum=0.9, output_dir='exps/tmp', pretrained=False, print_freq=20, resume='', roi_drop_pct=0.0, rpn_score_thresh=None, short=False, start_epoch=0, start_weights='', test_only_weights='', trainable_backbone_layers=None, trainsplit='trainkf', use_ia=False, valsplit='valfull', weight_decay=0.0001, workers=0, world_size=1)
Loading data
../data/idd_lsts/trainkf_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
../data/idd_lsts/valfull_conspec_cvat_frames.txt
ordered species are : ['bg', 'fragile pink urchin', 'gray gorgonian', 'squat lobster']
Creating data loaders
Using [0, 0.5, 0.6299605249474366, 0.7937005259840997, 1.0, 1.2599210498948732, 1.5874010519681994, 2.0, inf] as bins for aspect ratio quantization
Count of instances per bin: [3684]
Creating model
OUT CHANNELS 256
using ExponentialLR
Start training
/home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
terminate called after throwing an instance of 'c10::Error'
what(): linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/pytorch/aten/src/ATen/native/cuda/Indexing.cu":253, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor9241
Exception raised from index_put_with_sort_kernel at /pytorch/aten/src/ATen/native/cuda/Indexing.cu:253 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f070fd5c302 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f070fd58c9b in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x3e (0x7f070fd5918e in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: at::native::(anonymous namespace)::index_put_with_sort_kernel(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x2218 (0x7f0558c5f268 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: at::native::_index_put_impl_(at::Tensor&, c10::List<c10::optional<at::Tensor> > const&, at::Tensor const&, bool, bool) + 0x553 (0x7f059ab71423 in /home/mcever/.virtualenvs/tchnite/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
Aborted (core dumped)
Above seems to be the most important error, though the most cryptic.
Going back to torch-1.8.1, I decided to try the non-distributed command:
time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1 --print-freq 1
I ran this command twice, and still, the loss quickly begins to diverge slightly.
run 1 output:
Epoch: [0] [ 0/3684] eta: 0:16:34 lr: 0.000200 loss: 2.0382 (2.0382) loss_classifier: 1.3432 (1.3432) loss_box_reg: 0.0000 (0.0000) loss_objectness: 0.6915 (0.6915) loss_rpn_box_reg: 0.0035 (0.0035) time: 0.2699 data: 0.1061 max mem: 715
Epoch: [0] [ 1/3684] eta: 0:17:34 lr: 0.000300 loss: 2.0382 (2.0409) loss_classifier: 1.3432 (1.3464) loss_box_reg: 0.0000 (0.0000) loss_objectness: 0.6900 (0.6907) loss_rpn_box_reg: 0.0035 (0.0038) time: 0.2864 data: 0.1331 max mem: 1001
Epoch: [0] [ 2/3684] eta: 0:14:35 lr: 0.000400 loss: 2.0406 (2.0408) loss_classifier: 1.3432 (1.3372) loss_box_reg: 0.0000 (0.0095) loss_objectness: 0.6900 (0.6902) loss_rpn_box_reg: 0.0040 (0.0039) time: 0.2378 data: 0.0984 max mem: 1002
Epoch: [0] [ 3/3684] eta: 0:14:23 lr: 0.000500 loss: 2.0382 (2.0326) loss_classifier: 1.3189 (1.3283) loss_box_reg: 0.0000 (0.0072) loss_objectness: 0.6900 (0.6903) loss_rpn_box_reg: 0.0040 (0.0069) time: 0.2345 data: 0.1017 max mem: 1002
Epoch: [0] [ 4/3684] eta: 0:13:13 lr: 0.000600 loss: 2.0382 (2.0142) loss_classifier: 1.3189 (1.3124) loss_box_reg: 0.0000 (0.0057) loss_objectness: 0.6900 (0.6896) loss_rpn_box_reg: 0.0041 (0.0065) time: 0.2157 data: 0.0870 max mem: 1002
Epoch: [0] [ 5/3684] eta: 0:13:01 lr: 0.000699 loss: 2.0080 (1.9913) loss_classifier: 1.3016 (1.2898) loss_box_reg: 0.0000 (0.0048) loss_objectness: 0.6891 (0.6891) loss_rpn_box_reg: 0.0041 (0.0076) time: 0.2123 data: 0.0864 max mem: 1002
Epoch: [0] [ 6/3684] eta: 0:13:13 lr: 0.000799 loss: 2.0080 (1.9580) loss_classifier: 1.3016 (1.2569) loss_box_reg: 0.0000 (0.0041) loss_objectness: 0.6891 (0.6883) loss_rpn_box_reg: 0.0052 (0.0087) time: 0.2158 data: 0.0918 max mem: 1002
run 2 output:
Epoch: [0] [ 0/3684] eta: 0:16:12 lr: 0.000200 loss: 2.0382 (2.0382) loss_classifier: 1.3432 (1.3432) loss_box_reg: 0.0000 (0.0000) loss_objectness: 0.6915 (0.6915) loss_rpn_box_reg: 0.0035 (0.0035) time: 0.2640 data: 0.0829 max mem: 715
Epoch: [0] [ 1/3684] eta: 0:16:46 lr: 0.000300 loss: 2.0382 (2.0410) loss_classifier: 1.3432 (1.3464) loss_box_reg: 0.0000 (0.0000) loss_objectness: 0.6900 (0.6907) loss_rpn_box_reg: 0.0035 (0.0038) time: 0.2732 data: 0.1137 max mem: 1001
Epoch: [0] [ 2/3684] eta: 0:13:54 lr: 0.000400 loss: 2.0405 (2.0408) loss_classifier: 1.3432 (1.3372) loss_box_reg: 0.0000 (0.0095) loss_objectness: 0.6900 (0.6902) loss_rpn_box_reg: 0.0040 (0.0039) time: 0.2267 data: 0.0848 max mem: 1002
Epoch: [0] [ 3/3684] eta: 0:13:45 lr: 0.000500 loss: 2.0382 (2.0330) loss_classifier: 1.3188 (1.3287) loss_box_reg: 0.0000 (0.0072) loss_objectness: 0.6900 (0.6903) loss_rpn_box_reg: 0.0040 (0.0069) time: 0.2243 data: 0.0908 max mem: 1002
Epoch: [0] [ 4/3684] eta: 0:12:38 lr: 0.000600 loss: 2.0382 (2.0154) loss_classifier: 1.3188 (1.3135) loss_box_reg: 0.0000 (0.0057) loss_objectness: 0.6900 (0.6896) loss_rpn_box_reg: 0.0041 (0.0065) time: 0.2061 data: 0.0781 max mem: 1002
Epoch: [0] [ 5/3684] eta: 0:12:27 lr: 0.000699 loss: 2.0094 (1.9924) loss_classifier: 1.3030 (1.2909) loss_box_reg: 0.0000 (0.0048) loss_objectness: 0.6891 (0.6891) loss_rpn_box_reg: 0.0041 (0.0076) time: 0.2032 data: 0.0785 max mem: 1002
Epoch: [0] [ 6/3684] eta: 0:12:43 lr: 0.000799 loss: 2.0094 (1.9581) loss_classifier: 1.3030 (1.2569) loss_box_reg: 0.0000 (0.0041) loss_objectness: 0.6891 (0.6883) loss_rpn_box_reg: 0.0052 (0.0087) time: 0.2077 data: 0.0855 max mem: 1002
If I run my script with the nightly version and comment out torch.use_deterministic_algorithms(True):
(tchnite) $ time CUDA_VISIBLE_DEVICES=1 python train.py --epochs 10 --output-dir exps/tmp --lr 0.1 --workers 0 --batch-size 1 --print-freq 1
It runs seemingly without error, but again the losses aren’t the same across runs, so it seems the second error seems important, but I’m not sure how to fix it.