Purpose of normalization in Mask R-CNN

What is the purpose of the normalization layer in the first transform layer in Mask R-CNN?

MaskRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )

And how are those values calculated?

These stats are taken from the ImageNet training and are applied to the pretrained classification models as explained here.

I hope it is ok to use mean = 0 and std = 1 when I do not use a pretrained model.

Yes, you could use these values but note that the normalization would be disabled so you might also completely remove the Normalize transformation.
You might also see a slower convergence as your input data is not standardized anymore.

I want to process infrared images which I convert to grayscale. And I am not sure how to normalize them properly. I have no access to the temperature information which is stored in the images.
Is there a paper which describes why and how normalization could be done for different types of images?

In the common approach (e.g. training ImageNet) you would normalize the images first to the value range of [0, 1] (this is done via transforms.ToTensor()) and afterwards normalize/standardize these tensors such that they have a zero-mean and unit-variance via Noemalize(mean=..., std=...). These stats were calculated from the training dataset previously and you could do the same. I.e. iterate the training data once, calculate the mean and std of all training samples and store these stats.
If that’s not possible for some reasons, you might just normalize to the [0, 1] range and check if this would already allow the model to converge.

The images are normalized to [0,1]. I can derive the mean and std from the training dataset, but what if the conditions when taking the images for inference are different. This would lead to a different mean and std. If the conditions would be the same for all images, then this would work. But when conditions change, I have worries.
During training loss indicates that it converges. Loss goes down very fast. But AP does not look good:

Epoch: [0]  [   0/2000]  eta: 4:03:44  lr: 0.000010  loss: 6.0182 (6.0182)  loss_classifier: 1.1262 (1.1262)  loss_box_reg: 0.0003 (0.0003)  loss_mask: 4.1898 (4.1898)  loss_objectness: 0.6997 (0.6997)  loss_rpn_box_reg: 0.0022 (0.0022)  time: 7.3125  data: 3.9825  max mem: 5796
Epoch: [0]  [  10/2000]  eta: 0:36:13  lr: 0.000060  loss: 4.2399 (4.2851)  loss_classifier: 1.0880 (1.0858)  loss_box_reg: 0.0002 (0.0003)  loss_mask: 2.4519 (2.4992)  loss_objectness: 0.6978 (0.6976)  loss_rpn_box_reg: 0.0019 (0.0022)  time: 1.0923  data: 0.3691  max mem: 6069
Epoch: [0]  [  20/2000]  eta: 0:26:16  lr: 0.000110  loss: 2.5391 (3.3457)  loss_classifier: 1.0217 (1.0338)  loss_box_reg: 0.0001 (0.0002)  loss_mask: 0.7984 (1.6136)  loss_objectness: 0.6949 (0.6960)  loss_rpn_box_reg: 0.0018 (0.0021)  time: 0.4703  data: 0.0094  max mem: 6069
Epoch: [0]  [  30/2000]  eta: 0:22:39  lr: 0.000160  loss: 2.1717 (2.9366)  loss_classifier: 0.8988 (0.9540)  loss_box_reg: 0.0001 (0.0003)  loss_mask: 0.5829 (1.2859)  loss_objectness: 0.6930 (0.6942)  loss_rpn_box_reg: 0.0019 (0.0022)  time: 0.4688  data: 0.0071  max mem: 6069
Epoch: [0]  [  40/2000]  eta: 0:20:46  lr: 0.000210  loss: 1.8764 (2.6279)  loss_classifier: 0.6145 (0.8312)  loss_box_reg: 0.0010 (0.0008)  loss_mask: 0.5584 (1.1024)  loss_objectness: 0.6871 (0.6913)  loss_rpn_box_reg: 0.0021 (0.0022)  time: 0.4680  data: 0.0032  max mem: 6069
Epoch: [0]  [  50/2000]  eta: 0:19:37  lr: 0.000260  loss: 1.4983 (2.3771)  loss_classifier: 0.2249 (0.6934)  loss_box_reg: 0.0032 (0.0015)  loss_mask: 0.5443 (0.9956)  loss_objectness: 0.6727 (0.6843)  loss_rpn_box_reg: 0.0022 (0.0022)  time: 0.4703  data: 0.0049  max mem: 6069
Epoch: [0]  [  60/2000]  eta: 0:18:50  lr: 0.000310  loss: 1.2695 (2.1780)  loss_classifier: 0.0651 (0.5891)  loss_box_reg: 0.0045 (0.0020)  loss_mask: 0.5135 (0.9153)  loss_objectness: 0.6278 (0.6693)  loss_rpn_box_reg: 0.0022 (0.0022)  time: 0.4734  data: 0.0065  max mem: 6069
...
Epoch: [0]  [1960/2000]  eta: 0:00:19  lr: 0.005000  loss: 0.0843 (0.2153)  loss_classifier: 0.0142 (0.0336)  loss_box_reg: 0.0071 (0.0117)  loss_mask: 0.0615 (0.1386)  loss_objectness: 0.0006 (0.0300)  loss_rpn_box_reg: 0.0005 (0.0014)  time: 0.4795  data: 0.0055  max mem: 6069
Epoch: [0]  [1970/2000]  eta: 0:00:14  lr: 0.005000  loss: 0.0849 (0.2147)  loss_classifier: 0.0142 (0.0335)  loss_box_reg: 0.0076 (0.0117)  loss_mask: 0.0615 (0.1383)  loss_objectness: 0.0005 (0.0299)  loss_rpn_box_reg: 0.0006 (0.0014)  time: 0.4804  data: 0.0069  max mem: 6069
Epoch: [0]  [1980/2000]  eta: 0:00:09  lr: 0.005000  loss: 0.0891 (0.2141)  loss_classifier: 0.0145 (0.0334)  loss_box_reg: 0.0080 (0.0116)  loss_mask: 0.0641 (0.1379)  loss_objectness: 0.0005 (0.0297)  loss_rpn_box_reg: 0.0006 (0.0014)  time: 0.4799  data: 0.0069  max mem: 6069
Epoch: [0]  [1990/2000]  eta: 0:00:04  lr: 0.005000  loss: 0.0877 (0.2134)  loss_classifier: 0.0145 (0.0333)  loss_box_reg: 0.0076 (0.0116)  loss_mask: 0.0648 (0.1375)  loss_objectness: 0.0005 (0.0296)  loss_rpn_box_reg: 0.0006 (0.0014)  time: 0.4795  data: 0.0063  max mem: 6069
Epoch: [0]  [1999/2000]  eta: 0:00:00  lr: 0.005000  loss: 0.0874 (0.2129)  loss_classifier: 0.0146 (0.0332)  loss_box_reg: 0.0075 (0.0116)  loss_mask: 0.0639 (0.1372)  loss_objectness: 0.0005 (0.0295)  loss_rpn_box_reg: 0.0005 (0.0014)  time: 0.4789  data: 0.0055  max mem: 6069
Epoch: [0] Total time: 0:16:03 (0.4819 s / it)
Test:  [  0/500]  eta: 0:13:52  model_time: 0.5020 (0.5020)  evaluator_time: 0.0313 (0.0313)  time: 1.6658  data: 1.1165  max mem: 6069
Test:  [100/500]  eta: 0:01:52  model_time: 0.2344 (0.2321)  evaluator_time: 0.0156 (0.0232)  time: 0.2664  data: 0.0102  max mem: 6069
Test:  [200/500]  eta: 0:01:22  model_time: 0.2344 (0.2309)  evaluator_time: 0.0156 (0.0233)  time: 0.2680  data: 0.0039  max mem: 6069
Test:  [300/500]  eta: 0:00:54  model_time: 0.2344 (0.2305)  evaluator_time: 0.0156 (0.0230)  time: 0.2666  data: 0.0078  max mem: 6069
Test:  [400/500]  eta: 0:00:27  model_time: 0.2344 (0.2306)  evaluator_time: 0.0157 (0.0236)  time: 0.2680  data: 0.0039  max mem: 6069
Test:  [499/500]  eta: 0:00:00  model_time: 0.2344 (0.2302)  evaluator_time: 0.0156 (0.0236)  time: 0.2672  data: 0.0055  max mem: 6069
Test: Total time: 0:02:15 (0.2714 s / it)
Averaged stats: model_time: 0.2344 (0.2302)  evaluator_time: 0.0156 (0.0236)
Accumulating evaluation results...
DONE (t=0.28s).
Accumulating evaluation results...
DONE (t=0.38s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.015
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.009
IoU metric: segm
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.001
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.011
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.014
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.014
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.014
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.014
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.009

The detection of the bbox and mask is near to perfect when I test on some images, but the classification is very bad. Each of the two classes has a score at around 0.5.
But I have no idea how to improve this.

This is a general concern and not specific to the normalization.
Even if you are not normalizing the input data the model would still “learn” the training data distribution. If the validation or test data distribution changes, a worse performance would be expected.