Hi, I'm working on an experiment where I noticed large differences between model…s trained with identical configs and random seeds. I'm trying to understand the causes for this.
I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions:
https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility
However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP).
These differences occur in multiple runs on the same machine (identical device, code, config, random seed).
I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.
## Instructions To Reproduce the Issue:
1. Full runnable code or full changes you made:
script to reproduce the experiment (`deterministic_example.py`)
```
import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)
from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch
def setup(args):
"""
Create configs and perform basic setups.
"""
cfg = get_cfg()
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
cfg.freeze()
default_setup(cfg, args)
return cfg
def main(args):
cfg = setup(args)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
return trainer.train()
if __name__ == "__main__":
args = default_argument_parser().parse_args()
print("Command Line Args:", args)
launch(
main,
args.num_gpus,
num_machines=args.num_machines,
machine_rank=args.machine_rank,
dist_url=args.dist_url,
args=(args,),
)
```
```
git rev-parse HEAD; git diff
e091a07ef573915056f8c2191b774aad0e38d09c
```
2. What exact command you run:
```
CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1
```
3. __Full logs__ or other relevant observations:
```
Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1'])
[05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1
[05/23 15:49:08 detectron2]: Environment info:
---------------------- --------------------------------------------------------------------------------------------------------------------------
sys.platform linux
Python 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0]
numpy 1.22.3
detectron2 0.6 @/rootpath/git/detectron2/detectron2
Compiler GCC 9.3
CUDA compiler CUDA 11.5
detectron2 arch flags 6.1
DETECTRON2_ENV_MODULE <not set>
PyTorch 1.11.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0 NVIDIA TITAN Xp (arch=6.1)
Driver version 510.47.03
CUDA_HOME /usr/local/cuda-11.5
Pillow 9.1.0
torchvision 0.12.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20220504
iopath 0.1.9
cv2 4.5.5
---------------------- --------------------------------------------------------------------------------------------------------------------------
PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.5
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
[05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1'])
[05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml:
_BASE_: "../Base-RCNN-FPN.yaml"
MODEL:
WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
MASK_ON: True
RESNETS:
DEPTH: 50
FILTER_EMPTY_ANNOTATIONS: true
NUM_WORKERS: 1
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:
- coco_2017_val
TRAIN:
- coco_2017_train
GLOBAL:
HACK: 1.0
INPUT:
CROP:
ENABLED: false
SIZE:
- 0.9
- 0.9
TYPE: relative_range
FORMAT: BGR
MASK_FORMAT: polygon
MAX_SIZE_TEST: 1333
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 800
MIN_SIZE_TRAIN:
- 640
- 672
- 704
- 736
- 768
- 800
MIN_SIZE_TRAIN_SAMPLING: choice
RANDOM_FLIP: horizontal
MODEL:
ANCHOR_GENERATOR:
ANGLES:
- - -90
- 0
- 90
ASPECT_RATIOS:
- - 0.5
- 1.0
- 2.0
NAME
OFFSET: 0.0
SIZES:
- - 32
- - 64
- - 128
- - 256
- - 512
BACKBONE:
FREEZE_AT: 2
NAME: build_resnet_fpn_backbone
DEVICE: cuda
FPN:
FUSE_TYPE: sum
IN_FEATURES:
- res2
- res3
- res4
- res5
NORM: ''
OUT_CHANNELS: 256
KEYPOINT_ON: false
LOAD_PROPOSALS: false
MASK_ON: true
META_ARCHITECTURE: GeneralizedRCNN
PANOPTIC_FPN:
COMBINE:
ENABLED: true
INSTANCES_CONFIDENCE_THRESH: 0.5
OVERLAP_THRESH: 0.5
STUFF_AREA_LIMIT: 4096
INSTANCE_LOSS_WEIGHT: 1.0
PIXEL_MEAN:
- 103.53
- 116.28
- 123.675
PIXEL_STD:
- 1.0
- 1.0
- 1.0
PROPOSAL_GENERATOR:
MIN_SIZE: 0
NAME: RPN
RESNETS:
DEFORM_MODULATED: false
DEFORM_NUM_GROUPS: 1
DEFORM_ON_PER_STAGE:
-
- false
DEPTH: 50
NORM: FrozenBN
NUM_GROUPS: 1
OUT_FEATURES:
- res2
- res3
- res4
- res5
RES2_OUT_CHANNELS: 256
RES5_DILATION: 1
STEM_OUT_CHANNELS: 64
STRIDE_IN_1X1: true
WIDTH_PER_GROUP: 64
RETINANET:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_WEIGHTS: &id002
- 1.0
- 1.0
- 1.0
- 1.0
FOCAL_LOSS_ALPHA: 0.25
FOCAL_LOSS_GAMMA: 2.0
IN_FEATURES:
- p3
- p4
- p5
- p6
- p7
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.4
- 0.5
NMS_THRESH_TEST: 0.5
NORM: ''
NUM_CLASSES: 80
NUM_CONVS: 4
PRIOR_PROB: 0.01
SCORE_THRESH_TEST: 0.05
SMOOTH_L1_LOSS_BETA: 0.1
TOPK_CANDIDATES_TEST: 1000
ROI_BOX_CASCADE_HEAD:
BBOX_REG_WEIGHTS:
- &id
- 10.0
- 5.0
- 5.0
- - 20.0
- 20.0
- 10.0
- 10.0
- - 30.0
- 30.0
- 15.0
- 15.0
IOUS:
- 0.5
- 0.6
- 0.7
ROI_BOX_HEAD:
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id001
CLS_AGNOSTIC_BBOX_REG: false
CONV_DIM: 256
FC_DIM: 1024
FED_LOSS_FREQ_WEIGHT_POWER: 0.5
FED_LOSS_NUM_CLASSES: 50
NAME: FastRCNNConvFCHead
NORM: ''
NUM_CONV: 0
NUM_FC: 2
POOLER_RESOLUTION: 7
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
SMOOTH_L1_BETA: 0.0
TRAIN_ON_PRED_BOXES: false
USE_FED_LOSS: false
USE_SIGMOID_CE: false
ROI_HEADS:
BATCH_SIZE_PER_IMAGE: 512
IN_FEATURES:
- p2
- p3
- p4
- p5
IOU_LABELS:
- 0
- 1
IOU_THRESHOLDS:
- 0.5
NAME: StandardROIHeads
NMS_THRESH_TEST: 0.5
NUM_CLASSES: 80
POSITIVE_FRACTION: 0.25
PROPOSAL_APPEND_GT: true
SCORE_THRESH_TEST: 0.05
ROI_KEYPOINT_HEAD:
CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512
LOSS_WEIGHT: 1.0
MIN_KEYPOINTS_PER_IMAGE: 1
NAME: KRCNNConvDeconvUpsampleHead
NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
NUM_KEYPOINTS: 17
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
ROI_MASK_HEAD:
CLS_AGNOSTIC_MASK: false
CONV_DIM: 256
NAME: MaskRCNNConvUpsampleHead
NORM: ''
NUM_CONV: 4
POOLER_RESOLUTION: 14
POOLER_SAMPLING_RATIO: 0
POOLER_TYPE: ROIAlignV2
RPN:
BATCH_SIZE_PER_IMAGE: 256
BBOX_REG_LOSS_TYPE: smooth_l1
BBOX_REG_LOSS_WEIGHT: 1.0
BBOX_REG_WEIGHTS: *id002
BOUNDARY_THRESH: -1
CONV_DIMS:
- -1
HEAD_NAME: StandardRPNHead
IN_FEATURES:
-
- p4
- p5
- p6
IOU_LABELS:
- 0
- -1
- 1
IOU_THRESHOLDS:
- 0.3
- 0.7
LOSS_WEIGHT: 1.0
NMS_THRESH: 0.7
POSITIVE_FRACTION: 0.5
POST_NMS_TOPK_TEST: 1000
POST_NMS_TOPK_TRAIN: 1000
PRE_NMS_TOPK_TEST: 1000
PRE_NMS_TOPK_TRAIN: 2000
SMOOTH_L1_BETA: 0.0
SEM_SEG_HEAD:
COMMON_STRIDE: 4
CONVS_DIM: 128
IGNORE_VALUE: 255
IN_FEATURES:
- p2
- p3
- p4
- p5
LOSS_WEIGHT: 1.0
NAME: SemSegFPNHead
NORM: GN
NUM_CLASSES: 54
WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl
OUTPUT_DIR: ./output
SEED: 42
SOLVER:
AMP:
ENABLED: false
BASE_LR: 0.02
BASE_LR_END: 0.0
BIAS_LR_FACTOR: 1.0
CHECKPOINT_PERIOD: 5000
CLIP_GRADIENTS:
CLIP_TYPE: value
CLIP_VALUE: 1.0
ENABLED: false
NORM_TYPE: 2.0
GAMMA: 0.1
IMS_PER_BATCH: 1
LR_SCHEDULER_NAME: WarmupMultiStepLR
MAX_ITER: 90000
MOMENTUM: 0.9
NESTEROV: false
REFERENCE_WORLD_SIZE: 0
STEPS:
- 60000
- 80000
WARMUP_FACTOR: 0.001
WARMUP_ITERS: 1000
WARMUP_METHOD: linear
WEIGHT_DECAY: 0.0001
WEIGHT_DECAY_BIAS: null
WEIGHT_DECAY_NORM: 0.0
TEST:
AUG:
ENABLED: false
FLIP: true
MAX_SIZE: 4000
MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200
DETECTIONS_PER_IMAGE: 100
EVAL_PERIOD: 0
EXPECTED_RESULTS: []
KEYPOINT_OKS_SIGMAS: []
PRECISE_BN:
ENABLED: false
NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0
[05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBl
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
)
)
)
(proposal_generator): RPN(
(rpn_head): StandardRPNHead(
(conv): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1))
(anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1))
)
(anchor_generator): DefaultAnchorGenerator(
(cell_anchors): BufferList()
)
)
(roi_heads): StandardROIHeads(
(box_pooler): ROIPooler(
(level_poolers): ModuleList(
(0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True)
(1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True)
(2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True)
(3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True)
)
)
(box_head): FastRCNNConvFCHead(
(flatten): Flatten(start_dim=1, end_dim=-1)
(fc1): Linear(in_features=12544, out_features=1024, bias=True)
(fc_relu1): ReLU()
(fc2): Linear(in_features=1024, out_features=1024, bias=True)
(fc_relu2): ReLU()
)
(box_predictor): FastRCNNOutputLayers(
(cls_score): Linear(in_features=1024, out_features=81, bias=True)
(bbox_pred): Linear(in_features=1024, out_features=320, bias=True)
)
(mask_pooler): ROIPooler(
(level_poolers): ModuleList(
(0): ROIAlign(output_size=(14, 14), spatial_scale=0.25, sampling_ratio=0, aligned=True)
(1): ROIAlign(output_size=(14, 14), spatial_scale=0.125, sampling_ratio=0, aligned=True)
(2): ROIAlign(output_size=(14, 14), spatial_scale=0.0625, sampling_ratio=0, aligned=True)
(3): ROIAlign(output_size=(14, 14), spatial_scale=0.03125, sampling_ratio=0, aligned=True)
)
)
(mask_head): MaskRCNNConvUpsampleHead(
(mask_fcn1): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(mask_fcn2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(mask_fcn3): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(mask_fcn4): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(activation): ReLU()
)
(deconv): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2))
(deconv_relu): ReLU()
(predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1))
)
)
)
[05/23 15:49:30 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.03 seconds.
[05/23 15:49:31 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json
[05/23 15:49:37 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left.
[05/23 15:49:43 d2.data.build]: Distribution of instances among all 80 categories:
| category | #instances | category | #instances | category | #instances |
|:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------|
| person | 257253 | bicycle | 7056 | car | 43533 |
| motorcycle | 8654 | airplane | 5129 | bus | 6061 |
| train | 4570 | truck | 9970 | boat | 10576 |
| traffic light | 12842 | fire hydrant | 1865 | stop sign | 1983 |
| parking meter | 1283 | bench | 9820 | bird | 10542 |
| cat | 4766 | dog | 5500 | horse | 6567 |
| sheep | 9223 | cow | 8014 | elephant | 5484 |
| bear | 1294 | zebra | 5269 | giraffe | 5128 |
| backpack | 8714 | umbrella | 11265 | handbag | 12342 |
| tie | 6448 | suitcase | 6112 | frisbee | 2681 |
| skis | 6623 | snowboard | 2681 | sports ball | 6299 |
| kite | 8802 | baseball bat | 3273 | baseball gl.. | 3747 |
| skateboard | 5536 | surfboard | 6095 | tennis racket | 4807 |
| bottle | 24070 | wine glass | 7839 | cup | 20574 |
| fork | 5474 | knife | 7760 | spoon | 6159 |
| bowl | 14323 | banana | 9195 | apple | 5776 |
| sandwich | 4356 | orange | 6302 | broccoli | 7261 |
| carrot | 7758 | hot dog | 2884 | pizza | 5807 |
| donut | 7005 | cake | 6296 | chair | 38073 |
| couch | 5779 | potted plant | 8631 | bed | 4192 |
| dining table | 15695 | toilet | 4149 | tv | 5803 |
| laptop | 4960 | mouse | 2261 | remote | 5700 |
| keyboard | 2854 | cell phone | 6422 | microwave | 1672 |
| oven | 3334 | toaster | 225 | sink | 5609 |
| refrigerator | 2634 | book | 24077 | clock | 6320 |
| vase | 6577 | scissors | 1464 | teddy bear | 4729 |
| hair drier | 198 | toothbrush | 1945 | | |
| total | 849949 | | | | |
[05/23 15:49:43 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[05/23 15:49:43 d2.data.build]: Using training sampler TrainingSampler
[05/23 15:49:43 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ...
[05/23 15:49:47 d2.data.common]: Serialized dataset takes 451.21 MiB
[05/23 15:50:04 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ...
[05/23 15:50:04 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ......
[05/23 15:50:04 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up:
| Names in Model | Names in Checkpoint | Shapes |
|:------------------|:-------------------------|:------------------------------------------------|
| res2.0.conv1.* | res2_0_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,1,1) |
| res2.0.conv2.* | res2_0_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) |
| res2.0.conv3.* | res2_0_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) |
| res2.0.shortcut.* | res2_0_branch1_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) |
| res2.1.conv1.* | res2_1_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) |
| res2.1.conv2.* | res2_1_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) |
| res2.1.conv3.* | res2_1_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) |
| res2.2.conv1.* | res2_2_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) |
| res2.2.conv2.* | res2_2_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) |
| res2.2.conv3.* | res2_2_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) |
| res3.0.conv1.* | res3_0_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,256,1,1) |
| res3.0.conv2.* | res3_0_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.0.conv3.* | res3_0_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) |
| res3.0.shortcut.* | res3_0_branch1_{bn_*,w} | (512,) (512,) (512,) (512,) (512,256,1,1) |
| res3.1.conv1.* | res3_1_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) |
| res3.1.conv2.* | res3_1_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.1.conv3.* | res3_1_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) |
| res3.2.conv1.* | res3_2_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) |
| res3.2.conv2.* | res3_2_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.2.conv3.* | res3_2_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) |
| res3.3.conv1.* | res3_3_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) |
| res3.3.conv2.* | res3_3_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) |
| res3.3.conv3.* | res3_3_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) |
| res4.0.conv1.* | res4_0_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,512,1,1) |
| res4.0.conv2.* | res4_0_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.0.conv3.* | res4_0_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) |
| res4.0.shortcut.* | res4_0_branch1_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1) |
| res4.1.conv1.* | res4_1_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) |
| res4.1.conv2.* | res4_1_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.1.conv3.* | res4_1_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) |
| res4.2.conv1.* | res4_2_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) |
| res4.2.conv2.* | res4_2_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.2.conv3.* | res4_2_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) |
| res4.3.conv1.* | res4_3_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) |
| res4.3.conv2.* | res4_3_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.3.conv3.* | res4_3_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) |
| res4.4.conv1.* | res4_4_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) |
| res4.4.conv2.* | res4_4_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) |
| res4.4.conv3.* | res4_4_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) |
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
fc1000.{bias, weight}
stem.conv1.bias
[05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0
/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
fc1000.{bias, weight}
stem.conv1.bias
[05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0
/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
[05/23 15:50:12 d2.utils.events]: eta: 7:44:48 iter: 19 total_loss: 2.345 loss_cls: 0.5814 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3151 data_time: 0.0139 lr: 0.00039962 max_mem: 1481M
[05/23 15:50:19 d2.utils.events]: eta: 8:08:10 iter: 39 total_loss: 1.601 loss_cls: 0.4312 loss_box_reg: 0.04747 loss_mask: 0.6906 loss_rpn_cls: 0.4376 loss_rpn_loc: 0.0764 time: 0.3254 data_time: 0.0026 lr: 0.00079922 max_mem: 1481M
[05/23 15:50:26 d2.utils.events]: eta: 8:17:54 iter: 59 total_loss: 1.641 loss_cls: 0.4153 loss_box_reg: 0.09799 loss_mask: 0.691 loss_rpn_cls: 0.3649 loss_rpn_loc: 0.1253 time: 0.3259 data_time: 0.0028 lr: 0.0011988 max_mem: 1481M
[05/23 15:50:32 d2.utils.events]: eta: 8:20:12 iter: 79 total_loss: 1.439 loss_cls: 0.3282 loss_box_reg: 0.09175 loss_mask: 0.6924 loss_rpn_cls: 0.2477 loss_rpn_loc: 0.05234 time: 0.3288 data_time: 0.0027 lr: 0.0015984 max_mem: 1481M
[05/23 15:50:39 d2.utils.events]: eta: 8:20:06 iter: 99 total_loss: 1.285 loss_cls: 0.2667 loss_box_reg: 0.1191 loss_mask: 0.6891 loss_rpn_cls: 0.154 loss_rpn_loc: 0.05424 time: 0.3274 data_time: 0.0025 lr: 0.001998 max_mem: 1481M
[05/23 15:50:45 d2.utils.events]: eta: 8:15:39 iter: 119 total_loss: 1.52 loss_cls: 0.346 loss_box_reg: 0.1504 loss_mask: 0.6818 loss_rpn_cls: 0.2181 loss_rpn_loc: 0.09391 time: 0.3256 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M
[05/23 15:50:51 d2.utils.events]: eta: 8:12:57 iter: 139 total_loss: 1.546 loss_cls: 0.2511 loss_box_reg: 0.1242 loss_mask: 0.6869 loss_rpn_cls: 0.2738 loss_rpn_loc: 0.04643 time: 0.3242 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M
[05/23 15:50:58 d2.utils.events]: eta: 8:12:51 iter: 159 total_loss: 1.687 loss_cls: 0.3452 loss_box_reg: 0.09927 loss_mask: 0.6778 loss_rpn_cls: 0.2546 loss_rpn_loc: 0.1271 time: 0.3253 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M
[05/23 15:51:05 d2.utils.events]: eta: 8:15:19 iter: 179 total_loss: 1.557 loss_cls: 0.4099 loss_box_reg: 0.1837 loss_mask: 0.6872 loss_rpn_cls: 0.1388 loss_rpn_loc: 0.06568 time: 0.3271 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M
[05/23 15:51:12 d2.utils.events]: eta: 8:16:06 iter: 199 total_loss: 1.931 loss_cls: 0.5021 loss_box_reg: 0.2378 loss_mask: 0.6843 loss_rpn_cls: 0.2495 loss_rpn_loc: 0.1568 time: 0.3284 data_time: 0.0035 lr: 0.003996 max_mem: 1481M
```
run2:
```
[05/23 15:52:57 d2.utils.events]: eta: 7:49:54 iter: 19 total_loss: 2.349 loss_cls: 0.5801 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.09081 time: 0.3190 data_time: 0.0176 lr: 0.00039962 max_mem: 1481M
[05/23 15:53:04 d2.utils.events]: eta: 8:10:18 iter: 39 total_loss: 1.603 loss_cls: 0.4004 loss_box_reg: 0.04758 loss_mask: 0.6906 loss_rpn_cls: 0.4404 loss_rpn_loc: 0.07629 time: 0.3276 data_time: 0.0025 lr: 0.00079922 max_mem: 1481M
[05/23 15:53:10 d2.utils.events]: eta: 8:19:58 iter: 59 total_loss: 1.646 loss_cls: 0.4176 loss_box_reg: 0.1167 loss_mask: 0.6912 loss_rpn_cls: 0.3633 loss_rpn_loc: 0.1252 time: 0.3274 data_time: 0.0026 lr: 0.0011988 max_mem: 1481M
[05/23 15:53:17 d2.utils.events]: eta: 8:21:51 iter: 79 total_loss: 1.428 loss_cls: 0.299 loss_box_reg: 0.0902 loss_mask: 0.6921 loss_rpn_cls: 0.2449 loss_rpn_loc: 0.05256 time: 0.3296 data_time: 0.0026 lr: 0.0015984 max_mem: 1481M
[05/23 15:53:23 d2.utils.events]: eta: 8:21:44 iter: 99 total_loss: 1.319 loss_cls: 0.2876 loss_box_reg: 0.1062 loss_mask: 0.6898 loss_rpn_cls: 0.1512 loss_rpn_loc: 0.05531 time: 0.3289 data_time: 0.0027 lr: 0.001998 max_mem: 1481M
[05/23 15:53:30 d2.utils.events]: eta: 8:17:13 iter: 119 total_loss: 1.441 loss_cls: 0.28 loss_box_reg: 0.1317 loss_mask: 0.6835 loss_rpn_cls: 0.2149 loss_rpn_loc: 0.09209 time: 0.3274 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M
[05/23 15:53:36 d2.utils.events]: eta: 8:15:03 iter: 139 total_loss: 1.496 loss_cls: 0.272 loss_box_reg: 0.1103 loss_mask: 0.6876 loss_rpn_cls: 0.2564 loss_rpn_loc: 0.04832 time: 0.3262 data_time: 0.0025 lr: 0.0027972 max_mem: 1481M
[05/23 15:53:43 d2.utils.events]: eta: 8:14:56 iter: 159 total_loss: 1.737 loss_cls: 0.3486 loss_box_reg: 0.06897 loss_mask: 0.678 loss_rpn_cls: 0.2603 loss_rpn_loc: 0.1359 time: 0.3266 data_time: 0.0025 lr: 0.0031968 max_mem: 1481M
[05/23 15:53:49 d2.utils.events]: eta: 8:16:21 iter: 179 total_loss: 1.525 loss_cls: 0.3834 loss_box_reg: 0.1672 loss_mask: 0.6877 loss_rpn_cls: 0.1623 loss_rpn_loc: 0.08118 time: 0.3272 data_time: 0.0026 lr: 0.0035964 max_mem: 1481M
[05/23 15:53:56 d2.utils.events]: eta: 8:16:14 iter: 199 total_loss: 1.598 loss_cls: 0.3331 loss_box_reg: 0.1141 loss_mask: 0.6792 loss_rpn_cls: 0.2563 loss_rpn_loc: 0.1831 time: 0.3270 data_time: 0.0026 lr: 0.003996 max_mem: 1481M
```
run3:
```
[05/23 15:56:10 d2.utils.events]: eta: 7:45:39 iter: 19 total_loss: 2.348 loss_cls: 0.5763 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3167 data_time: 0.0122 lr: 0.00039962 max_mem: 1481M
[05/23 15:56:16 d2.utils.events]: eta: 8:10:26 iter: 39 total_loss: 1.605 loss_cls: 0.3891 loss_box_reg: 0.04755 loss_mask: 0.6906 loss_rpn_cls: 0.4403 loss_rpn_loc: 0.07635 time: 0.3277 data_time: 0.0027 lr: 0.00079922 max_mem: 1481M
[05/23 15:56:23 d2.utils.events]: eta: 8:23:04 iter: 59 total_loss: 1.679 loss_cls: 0.4163 loss_box_reg: 0.1102 loss_mask: 0.6912 loss_rpn_cls: 0.3563 loss_rpn_loc: 0.1251 time: 0.3293 data_time: 0.0031 lr: 0.0011988 max_mem: 1481M
[05/23 15:56:30 d2.utils.events]: eta: 8:21:28 iter: 79 total_loss: 1.433 loss_cls: 0.3133 loss_box_reg: 0.07978 loss_mask: 0.6921 loss_rpn_cls: 0.2468 loss_rpn_loc: 0.05257 time: 0.3303 data_time: 0.0028 lr: 0.0015984 max_mem: 1481M
[05/23 15:56:36 d2.utils.events]: eta: 8:22:50 iter: 99 total_loss: 1.317 loss_cls: 0.2764 loss_box_reg: 0.1469 loss_mask: 0.6895 loss_rpn_cls: 0.1487 loss_rpn_loc: 0.05474 time: 0.3291 data_time: 0.0027 lr: 0.001998 max_mem: 1481M
[05/23 15:56:43 d2.utils.events]: eta: 8:20:03 iter: 119 total_loss: 1.455 loss_cls: 0.3264 loss_box_reg: 0.1456 loss_mask: 0.6827 loss_rpn_cls: 0.209 loss_rpn_loc: 0.09486 time: 0.3281 data_time: 0.0030 lr: 0.0023976 max_mem: 1481M
[05/23 15:56:49 d2.utils.events]: eta: 8:16:57 iter: 139 total_loss: 1.475 loss_cls: 0.2835 loss_box_reg: 0.09706 loss_mask: 0.6861 loss_rpn_cls: 0.2541 loss_rpn_loc: 0.04725 time: 0.3260 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M
[05/23 15:56:56 d2.utils.events]: eta: 8:18:19 iter: 159 total_loss: 1.675 loss_cls: 0.3287 loss_box_reg: 0.1219 loss_mask: 0.6776 loss_rpn_cls: 0.2344 loss_rpn_loc: 0.1299 time: 0.3269 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M
[05/23 15:57:02 d2.utils.events]: eta: 8:19:43 iter: 179 total_loss: 1.568 loss_cls: 0.4459 loss_box_reg: 0.1866 loss_mask: 0.6875 loss_rpn_cls: 0.124 loss_rpn_loc: 0.06825 time: 0.3279 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M
[05/23 15:57:09 d2.utils.events]: eta: 8:19:37 iter: 199 total_loss: 1.803 loss_cls: 0.4938 loss_box_reg: 0.1835 loss_mask: 0.6884 loss_rpn_cls: 0.2585 loss_rpn_loc: 0.1701 time: 0.3281 data_time: 0.0029 lr: 0.003996 max_mem: 1481M
```
## Expected behavior:
I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.