Detectron model giving different results for different machines (constant seed)

Dawny33 · June 24, 2022, 1:06pm

My training script for the model:

seed = 42
import random 
import os
import numpy as np
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True

seed_everything(seed)


from detectron2.engine import DefaultTrainer
from detectron2.config import get_cfg
from detectron2.data.catalog import Metadata

import os

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("experiment",)
cfg.DATASETS.TEST = ("test",)
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
cfg.MODEL.DEVICE = "cuda"
cfg.SOLVER.IMS_PER_BATCH = 2
num_gpu = 1
bs = (num_gpu * 2)
cfg.SOLVER.BASE_LR = 0.02 * bs / 16
cfg.SOLVER.MAX_ITER = 7500   
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

My inference script on server-1 is:

import cv2
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7

cfg.SEED = 42
predictor = DefaultPredictor(cfg)

img = cv2.imread('filename.jpg')
outputs = predictor(img)
print(outputs["instances"])

pred_classes = outputs['instances'].pred_classes.tolist()
classes = ["Handwritten", "Logo", "Markings", "Signature"]

for pred_class in pred_classes:
    print('*'*10)
    print(classes[pred_class])
    print('*'*10)

if any(classes[pred_class] == "Handwritten" for pred_class in pred_classes):
    print(True)
else:
    print(False)

My inference script on server-2 is:

class Handwritten:
    """
    Detects a list of handwritten pages in a PDF chart.
    Attributes
    ----------
    path_of_model : str
        Path where the trained model is stored.
    path_of_weights : str
        Path where the weights file is stored.
    """

    def __init__(self, path_of_weights: str) -> None:
        """Initialize Handwritten class.
        Parameters
        ----------
        path_of_model : str
        path_of_weights : str
        """
        self.cfg = get_cfg()
        self.cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7
        self.cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4
        self.cfg.MODEL.WEIGHTS = path_of_weights
        self.cfg.MODEL.DEVICE = "cpu"
        self.cfg.SEED = 42
        self.predictor = DefaultPredictor(self.cfg)
        self.metadata = Metadata()
        self.metadata.set(
            thing_classes=["Handwritten", "Logo", "Markings", "Signature"],
            thing_dataset_id_to_contiguous_id={0: 0, 1: 1, 2: 2, 3: 3},
        )

    def __call__(self, img: Any) -> Any:
        """Return the predicted output classes for the image."""
        self.outputs = self.predictor(img)
        return self.outputs["instances"]

    def detect_hw(self, image: Any) -> bool:
        """Detect handwritten dx entity in image and if present then classifies it as hw page.
        Parameters
        ----------
        image : Matrix
            .Image matrix of a page.
        Return
        -------
        True/False : bool
            Boolean value that states if the page is handwritten or not.
        """
        outputs = self.__call__(image)
        pred_classes = outputs.pred_classes.tolist()
        classes = ["Handwritten", "Logo", "Markings", "Signature"]

        if any(classes[pred_class] == "Handwritten" for pred_class in pred_classes):
            return True
        else:
            return False


app = FastAPI()
path_of_weights = "model/model_final.pth"
model = Handwritten(path_of_weights)

@app.post("/cv/predict", status_code=200)
def predict(
    page_no: int = Form(...), dimensions: list = Form(...), image: UploadFile = File(...)
) -> Dict[str, int]:
    """Predicts if image is handwritten page or not.
    Parameters
    ----------
    page_no : Page number of the given input page
    dimensions : Height and width of the page
    image : Image of the page as bytestream
    """
    image_bytes = image.file.read()
    decoded_image = cv2.imdecode(np.frombuffer(image_bytes, np.uint8), -1)
    height, width = int(dimensions[0]), int(dimensions[1])
    prediction_time = time.time()
    pg_image = cv2.resize(decoded_image, (height, width))
    try:
        # Check if page is handwritten
        hw_result = model.detect_hw(pg_image)

        # If handwritten, consider for output
        if hw_result:
            hw_pages = page_no

        else:
            hw_pages = -99

        prediction_info = {
            "hw_pages": hw_pages,
            "prediction_time": prediction_time,
        }
        #_logger.info(f"prediction info: {prediction_info}")
    except HTTPError as e:
        do something
    
    return {"hw_pages": hw_pages}

While the model keeps giving good results on server-1, it is somehow being very erratic in server-2. The weights and the seed is the same. Somehow, I am unable to understand this change in behavior in both of these scenarios.

The model is trained on server-1

Server-1 is g4dn.2xlarge. Server-2 is g4dn.xlarge

Is there something wrong which I am doing?

ptrblck · June 25, 2022, 6:53am

Please correct me if I’m wrong, but isn’t the main difference between g4dn.2xlarge and g4dn.xlarge the increase in vCPUs, RAM, storage, and network bandwidth?
Both should have the same T4 GPU so unless the software stack differs I wouldn’t know what causes the difference (assuming you are using the GPU).

j-rausch · June 30, 2022, 10:26am

I’ve experienced issues with differing results on detectron2 as well:

github.com/facebookresearch/detectron2

Repeated training not deterministic despite identical setup and reproducibility flags

opened 02:28PM - 23 May 22 UTC

j-rausch

Hi, I'm working on an experiment where I noticed large differences between model…s trained with identical configs and random seeds. I'm trying to understand the causes for this. I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions: https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP). These differences occur in multiple runs on the same machine (identical device, code, config, random seed). I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs. ## Instructions To Reproduce the Issue: 1. Full runnable code or full changes you made: script to reproduce the experiment (`deterministic_example.py`) ``` import os os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8" import torch torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False torch.use_deterministic_algorithms(True) from detectron2.config import get_cfg from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch def setup(args): """ Create configs and perform basic setups. """ cfg = get_cfg() cfg.merge_from_file(args.config_file) cfg.merge_from_list(args.opts) cfg.freeze() default_setup(cfg, args) return cfg def main(args): cfg = setup(args) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) return trainer.train() if __name__ == "__main__": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), ) ``` ``` git rev-parse HEAD; git diff e091a07ef573915056f8c2191b774aad0e38d09c ``` 2. What exact command you run: ``` CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1 ``` 3. __Full logs__ or other relevant observations: ``` Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1 [05/23 15:49:08 detectron2]: Environment info: ---------------------- -------------------------------------------------------------------------------------------------------------------------- sys.platform linux Python 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:39:04) [GCC 10.3.0] numpy 1.22.3 detectron2 0.6 @/rootpath/git/detectron2/detectron2 Compiler GCC 9.3 CUDA compiler CUDA 11.5 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE <not set> PyTorch 1.11.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA TITAN Xp (arch=6.1) Driver version 510.47.03 CUDA_HOME /usr/local/cuda-11.5 Pillow 9.1.0 torchvision 0.12.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220504 iopath 0.1.9 cv2 4.5.5 ---------------------- -------------------------------------------------------------------------------------------------------------------------- PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX2 - CUDA Runtime 11.5 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86 - CuDNN 8.3.2 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, [05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml: _BASE_: "../Base-RCNN-FPN.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" MASK_ON: True RESNETS: DEPTH: 50 FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 1 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST: - coco_2017_val TRAIN: - coco_2017_train GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: false SIZE: - 0.9 - 0.9 TYPE: relative_range FORMAT: BGR MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: - 640 - 672 - 704 - 736 - 768 - 800 MIN_SIZE_TRAIN_SAMPLING: choice RANDOM_FLIP: horizontal MODEL: ANCHOR_GENERATOR: ANGLES: - - -90 - 0 - 90 ASPECT_RATIOS: - - 0.5 - 1.0 - 2.0 NAME OFFSET: 0.0 SIZES: - - 32 - - 64 - - 128 - - 256 - - 512 BACKBONE: FREEZE_AT: 2 NAME: build_resnet_fpn_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: - res2 - res3 - res4 - res5 NORM: '' OUT_CHANNELS: 256 KEYPOINT_ON: false LOAD_PROPOSALS: false MASK_ON: true META_ARCHITECTURE: GeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: true INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: - 103.53 - 116.28 - 123.675 PIXEL_STD: - 1.0 - 1.0 - 1.0 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: - - false DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: - res2 - res3 - res4 - res5 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: true WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: &id002 - 1.0 - 1.0 - 1.0 - 1.0 FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: - p3 - p4 - p5 - p6 - p7 IOU_LABELS: - 0 - -1 - 1 IOU_THRESHOLDS: - 0.4 - 0.5 NMS_THRESH_TEST: 0.5 NORM: '' NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: - &id - 10.0 - 5.0 - 5.0 - - 20.0 - 20.0 - 10.0 - 10.0 - - 30.0 - 30.0 - 15.0 - 15.0 IOUS: - 0.5 - 0.6 - 0.7 ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: *id001 CLS_AGNOSTIC_BBOX_REG: false CONV_DIM: 256 FC_DIM: 1024 FED_LOSS_FREQ_WEIGHT_POWER: 0.5 FED_LOSS_NUM_CLASSES: 50 NAME: FastRCNNConvFCHead NORM: '' NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: false USE_FED_LOSS: false USE_SIGMOID_CE: false ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: - p2 - p3 - p4 - p5 IOU_LABELS: - 0 - 1 IOU_THRESHOLDS: - 0.5 NAME: StandardROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: true SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: - 512 - 512 - 512 - 512 - 512 - 512 - 512 - 512 LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: false CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: '' NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: *id002 BOUNDARY_THRESH: -1 CONV_DIMS: - -1 HEAD_NAME: StandardRPNHead IN_FEATURES: - - p4 - p5 - p6 IOU_LABELS: - 0 - -1 - 1 IOU_THRESHOLDS: - 0.3 - 0.7 LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 1000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: - p2 - p3 - p4 - p5 LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl OUTPUT_DIR: ./output SEED: 42 SOLVER: AMP: ENABLED: false BASE_LR: 0.02 BASE_LR_END: 0.0 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 5000 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: false NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 1 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 90000 MOMENTUM: 0.9 NESTEROV: false REFERENCE_WORLD_SIZE: 0 STEPS: - 60000 - 80000 WARMUP_FACTOR: 0.001 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: null WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: false FLIP: true MAX_SIZE: 4000 MIN_SIZES: - 400 - 500 - 600 - 700 - 800 - 900 - 1000 - 1100 - 1200 DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: false NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBl ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) ) (proposal_generator): RPN( (rpn_head): StandardRPNHead( (conv): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) ) (roi_heads): StandardROIHeads( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): FastRCNNConvFCHead( (flatten): Flatten(start_dim=1, end_dim=-1) (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc_relu1): ReLU() (fc2): Linear(in_features=1024, out_features=1024, bias=True) (fc_relu2): ReLU() ) (box_predictor): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=81, bias=True) (bbox_pred): Linear(in_features=1024, out_features=320, bias=True) ) (mask_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(14, 14), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(14, 14), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(14, 14), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(14, 14), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (mask_head): MaskRCNNConvUpsampleHead( (mask_fcn1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn3): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn4): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (deconv): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2)) (deconv_relu): ReLU() (predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1)) ) ) ) [05/23 15:49:30 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.03 seconds. [05/23 15:49:31 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [05/23 15:49:37 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left. [05/23 15:49:43 d2.data.build]: Distribution of instances among all 80 categories: | category | #instances | category | #instances | category | #instances | |:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------| | person | 257253 | bicycle | 7056 | car | 43533 | | motorcycle | 8654 | airplane | 5129 | bus | 6061 | | train | 4570 | truck | 9970 | boat | 10576 | | traffic light | 12842 | fire hydrant | 1865 | stop sign | 1983 | | parking meter | 1283 | bench | 9820 | bird | 10542 | | cat | 4766 | dog | 5500 | horse | 6567 | | sheep | 9223 | cow | 8014 | elephant | 5484 | | bear | 1294 | zebra | 5269 | giraffe | 5128 | | backpack | 8714 | umbrella | 11265 | handbag | 12342 | | tie | 6448 | suitcase | 6112 | frisbee | 2681 | | skis | 6623 | snowboard | 2681 | sports ball | 6299 | | kite | 8802 | baseball bat | 3273 | baseball gl.. | 3747 | | skateboard | 5536 | surfboard | 6095 | tennis racket | 4807 | | bottle | 24070 | wine glass | 7839 | cup | 20574 | | fork | 5474 | knife | 7760 | spoon | 6159 | | bowl | 14323 | banana | 9195 | apple | 5776 | | sandwich | 4356 | orange | 6302 | broccoli | 7261 | | carrot | 7758 | hot dog | 2884 | pizza | 5807 | | donut | 7005 | cake | 6296 | chair | 38073 | | couch | 5779 | potted plant | 8631 | bed | 4192 | | dining table | 15695 | toilet | 4149 | tv | 5803 | | laptop | 4960 | mouse | 2261 | remote | 5700 | | keyboard | 2854 | cell phone | 6422 | microwave | 1672 | | oven | 3334 | toaster | 225 | sink | 5609 | | refrigerator | 2634 | book | 24077 | clock | 6320 | | vase | 6577 | scissors | 1464 | teddy bear | 4729 | | hair drier | 198 | toothbrush | 1945 | | | | total | 849949 | | | | | [05/23 15:49:43 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [05/23 15:49:43 d2.data.build]: Using training sampler TrainingSampler [05/23 15:49:43 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [05/23 15:49:47 d2.data.common]: Serialized dataset takes 451.21 MiB [05/23 15:50:04 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ...... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up: | Names in Model | Names in Checkpoint | Shapes | |:------------------|:-------------------------|:------------------------------------------------| | res2.0.conv1.* | res2_0_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,1,1) | | res2.0.conv2.* | res2_0_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.0.conv3.* | res2_0_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | | res2.0.shortcut.* | res2_0_branch1_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | | res2.1.conv1.* | res2_1_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) | | res2.1.conv2.* | res2_1_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.1.conv3.* | res2_1_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | | res2.2.conv1.* | res2_2_branch2a_{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) | | res2.2.conv2.* | res2_2_branch2b_{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | | res2.2.conv3.* | res2_2_branch2c_{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | | res3.0.conv1.* | res3_0_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,256,1,1) | | res3.0.conv2.* | res3_0_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.0.conv3.* | res3_0_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | | res3.0.shortcut.* | res3_0_branch1_{bn_*,w} | (512,) (512,) (512,) (512,) (512,256,1,1) | | res3.1.conv1.* | res3_1_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | | res3.1.conv2.* | res3_1_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.1.conv3.* | res3_1_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | | res3.2.conv1.* | res3_2_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | | res3.2.conv2.* | res3_2_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.2.conv3.* | res3_2_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | | res3.3.conv1.* | res3_3_branch2a_{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | | res3.3.conv2.* | res3_3_branch2b_{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | | res3.3.conv3.* | res3_3_branch2c_{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | | res4.0.conv1.* | res4_0_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,512,1,1) | | res4.0.conv2.* | res4_0_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.0.conv3.* | res4_0_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | | res4.0.shortcut.* | res4_0_branch1_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1) | | res4.1.conv1.* | res4_1_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | | res4.1.conv2.* | res4_1_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.1.conv3.* | res4_1_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | | res4.2.conv1.* | res4_2_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | | res4.2.conv2.* | res4_2_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.2.conv3.* | res4_2_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | | res4.3.conv1.* | res4_3_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | | res4.3.conv2.* | res4_3_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.3.conv3.* | res4_3_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | | res4.4.conv1.* | res4_4_branch2a_{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | | res4.4.conv2.* | res4_4_branch2b_{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | | res4.4.conv3.* | res4_4_branch2c_{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [05/23 15:50:12 d2.utils.events]: eta: 7:44:48 iter: 19 total_loss: 2.345 loss_cls: 0.5814 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3151 data_time: 0.0139 lr: 0.00039962 max_mem: 1481M [05/23 15:50:19 d2.utils.events]: eta: 8:08:10 iter: 39 total_loss: 1.601 loss_cls: 0.4312 loss_box_reg: 0.04747 loss_mask: 0.6906 loss_rpn_cls: 0.4376 loss_rpn_loc: 0.0764 time: 0.3254 data_time: 0.0026 lr: 0.00079922 max_mem: 1481M [05/23 15:50:26 d2.utils.events]: eta: 8:17:54 iter: 59 total_loss: 1.641 loss_cls: 0.4153 loss_box_reg: 0.09799 loss_mask: 0.691 loss_rpn_cls: 0.3649 loss_rpn_loc: 0.1253 time: 0.3259 data_time: 0.0028 lr: 0.0011988 max_mem: 1481M [05/23 15:50:32 d2.utils.events]: eta: 8:20:12 iter: 79 total_loss: 1.439 loss_cls: 0.3282 loss_box_reg: 0.09175 loss_mask: 0.6924 loss_rpn_cls: 0.2477 loss_rpn_loc: 0.05234 time: 0.3288 data_time: 0.0027 lr: 0.0015984 max_mem: 1481M [05/23 15:50:39 d2.utils.events]: eta: 8:20:06 iter: 99 total_loss: 1.285 loss_cls: 0.2667 loss_box_reg: 0.1191 loss_mask: 0.6891 loss_rpn_cls: 0.154 loss_rpn_loc: 0.05424 time: 0.3274 data_time: 0.0025 lr: 0.001998 max_mem: 1481M [05/23 15:50:45 d2.utils.events]: eta: 8:15:39 iter: 119 total_loss: 1.52 loss_cls: 0.346 loss_box_reg: 0.1504 loss_mask: 0.6818 loss_rpn_cls: 0.2181 loss_rpn_loc: 0.09391 time: 0.3256 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:50:51 d2.utils.events]: eta: 8:12:57 iter: 139 total_loss: 1.546 loss_cls: 0.2511 loss_box_reg: 0.1242 loss_mask: 0.6869 loss_rpn_cls: 0.2738 loss_rpn_loc: 0.04643 time: 0.3242 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:50:58 d2.utils.events]: eta: 8:12:51 iter: 159 total_loss: 1.687 loss_cls: 0.3452 loss_box_reg: 0.09927 loss_mask: 0.6778 loss_rpn_cls: 0.2546 loss_rpn_loc: 0.1271 time: 0.3253 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:51:05 d2.utils.events]: eta: 8:15:19 iter: 179 total_loss: 1.557 loss_cls: 0.4099 loss_box_reg: 0.1837 loss_mask: 0.6872 loss_rpn_cls: 0.1388 loss_rpn_loc: 0.06568 time: 0.3271 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:51:12 d2.utils.events]: eta: 8:16:06 iter: 199 total_loss: 1.931 loss_cls: 0.5021 loss_box_reg: 0.2378 loss_mask: 0.6843 loss_rpn_cls: 0.2495 loss_rpn_loc: 0.1568 time: 0.3284 data_time: 0.0035 lr: 0.003996 max_mem: 1481M ``` run2: ``` [05/23 15:52:57 d2.utils.events]: eta: 7:49:54 iter: 19 total_loss: 2.349 loss_cls: 0.5801 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.09081 time: 0.3190 data_time: 0.0176 lr: 0.00039962 max_mem: 1481M [05/23 15:53:04 d2.utils.events]: eta: 8:10:18 iter: 39 total_loss: 1.603 loss_cls: 0.4004 loss_box_reg: 0.04758 loss_mask: 0.6906 loss_rpn_cls: 0.4404 loss_rpn_loc: 0.07629 time: 0.3276 data_time: 0.0025 lr: 0.00079922 max_mem: 1481M [05/23 15:53:10 d2.utils.events]: eta: 8:19:58 iter: 59 total_loss: 1.646 loss_cls: 0.4176 loss_box_reg: 0.1167 loss_mask: 0.6912 loss_rpn_cls: 0.3633 loss_rpn_loc: 0.1252 time: 0.3274 data_time: 0.0026 lr: 0.0011988 max_mem: 1481M [05/23 15:53:17 d2.utils.events]: eta: 8:21:51 iter: 79 total_loss: 1.428 loss_cls: 0.299 loss_box_reg: 0.0902 loss_mask: 0.6921 loss_rpn_cls: 0.2449 loss_rpn_loc: 0.05256 time: 0.3296 data_time: 0.0026 lr: 0.0015984 max_mem: 1481M [05/23 15:53:23 d2.utils.events]: eta: 8:21:44 iter: 99 total_loss: 1.319 loss_cls: 0.2876 loss_box_reg: 0.1062 loss_mask: 0.6898 loss_rpn_cls: 0.1512 loss_rpn_loc: 0.05531 time: 0.3289 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:53:30 d2.utils.events]: eta: 8:17:13 iter: 119 total_loss: 1.441 loss_cls: 0.28 loss_box_reg: 0.1317 loss_mask: 0.6835 loss_rpn_cls: 0.2149 loss_rpn_loc: 0.09209 time: 0.3274 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:53:36 d2.utils.events]: eta: 8:15:03 iter: 139 total_loss: 1.496 loss_cls: 0.272 loss_box_reg: 0.1103 loss_mask: 0.6876 loss_rpn_cls: 0.2564 loss_rpn_loc: 0.04832 time: 0.3262 data_time: 0.0025 lr: 0.0027972 max_mem: 1481M [05/23 15:53:43 d2.utils.events]: eta: 8:14:56 iter: 159 total_loss: 1.737 loss_cls: 0.3486 loss_box_reg: 0.06897 loss_mask: 0.678 loss_rpn_cls: 0.2603 loss_rpn_loc: 0.1359 time: 0.3266 data_time: 0.0025 lr: 0.0031968 max_mem: 1481M [05/23 15:53:49 d2.utils.events]: eta: 8:16:21 iter: 179 total_loss: 1.525 loss_cls: 0.3834 loss_box_reg: 0.1672 loss_mask: 0.6877 loss_rpn_cls: 0.1623 loss_rpn_loc: 0.08118 time: 0.3272 data_time: 0.0026 lr: 0.0035964 max_mem: 1481M [05/23 15:53:56 d2.utils.events]: eta: 8:16:14 iter: 199 total_loss: 1.598 loss_cls: 0.3331 loss_box_reg: 0.1141 loss_mask: 0.6792 loss_rpn_cls: 0.2563 loss_rpn_loc: 0.1831 time: 0.3270 data_time: 0.0026 lr: 0.003996 max_mem: 1481M ``` run3: ``` [05/23 15:56:10 d2.utils.events]: eta: 7:45:39 iter: 19 total_loss: 2.348 loss_cls: 0.5763 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3167 data_time: 0.0122 lr: 0.00039962 max_mem: 1481M [05/23 15:56:16 d2.utils.events]: eta: 8:10:26 iter: 39 total_loss: 1.605 loss_cls: 0.3891 loss_box_reg: 0.04755 loss_mask: 0.6906 loss_rpn_cls: 0.4403 loss_rpn_loc: 0.07635 time: 0.3277 data_time: 0.0027 lr: 0.00079922 max_mem: 1481M [05/23 15:56:23 d2.utils.events]: eta: 8:23:04 iter: 59 total_loss: 1.679 loss_cls: 0.4163 loss_box_reg: 0.1102 loss_mask: 0.6912 loss_rpn_cls: 0.3563 loss_rpn_loc: 0.1251 time: 0.3293 data_time: 0.0031 lr: 0.0011988 max_mem: 1481M [05/23 15:56:30 d2.utils.events]: eta: 8:21:28 iter: 79 total_loss: 1.433 loss_cls: 0.3133 loss_box_reg: 0.07978 loss_mask: 0.6921 loss_rpn_cls: 0.2468 loss_rpn_loc: 0.05257 time: 0.3303 data_time: 0.0028 lr: 0.0015984 max_mem: 1481M [05/23 15:56:36 d2.utils.events]: eta: 8:22:50 iter: 99 total_loss: 1.317 loss_cls: 0.2764 loss_box_reg: 0.1469 loss_mask: 0.6895 loss_rpn_cls: 0.1487 loss_rpn_loc: 0.05474 time: 0.3291 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:56:43 d2.utils.events]: eta: 8:20:03 iter: 119 total_loss: 1.455 loss_cls: 0.3264 loss_box_reg: 0.1456 loss_mask: 0.6827 loss_rpn_cls: 0.209 loss_rpn_loc: 0.09486 time: 0.3281 data_time: 0.0030 lr: 0.0023976 max_mem: 1481M [05/23 15:56:49 d2.utils.events]: eta: 8:16:57 iter: 139 total_loss: 1.475 loss_cls: 0.2835 loss_box_reg: 0.09706 loss_mask: 0.6861 loss_rpn_cls: 0.2541 loss_rpn_loc: 0.04725 time: 0.3260 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:56:56 d2.utils.events]: eta: 8:18:19 iter: 159 total_loss: 1.675 loss_cls: 0.3287 loss_box_reg: 0.1219 loss_mask: 0.6776 loss_rpn_cls: 0.2344 loss_rpn_loc: 0.1299 time: 0.3269 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:57:02 d2.utils.events]: eta: 8:19:43 iter: 179 total_loss: 1.568 loss_cls: 0.4459 loss_box_reg: 0.1866 loss_mask: 0.6875 loss_rpn_cls: 0.124 loss_rpn_loc: 0.06825 time: 0.3279 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:57:09 d2.utils.events]: eta: 8:19:37 iter: 199 total_loss: 1.803 loss_cls: 0.4938 loss_box_reg: 0.1835 loss_mask: 0.6884 loss_rpn_cls: 0.2585 loss_rpn_loc: 0.1701 time: 0.3281 data_time: 0.0029 lr: 0.003996 max_mem: 1481M ``` ## Expected behavior: I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.

Even using the PyTorch flags for deterministic training, in addition to setting the same random seed, didn’t fix it so far.