Have I shot myself in the foot by training on a cluster but doing inference on a CPU?

I have trained a segmentation model on a large cluster (don’t have the exact specs at the moment but think 512GB RAM, multiple Ampere/Volta GPUs, etc).

When I run inference on the cluster I get excellent results. When I run inference on the same data on my local GPU or CPU machine(s) I get significantly worse results.

In researching this I came across the topic of floating point precision: FP32, TF32, etc…

I cannot rely on the cluster to do inference. What is my best course of action?
I assume, from my limited understanding that I need to retrain with TF32 OFF and probably train for longer to achieve comparable results?

Try to narrow down where the difference comes from as the same issue was already discussed multiple times in this forum. Common reasons were:

  • different data processing,
  • model checkpointing failed (e.g. strict=False was used this not loading any parameters),
  • batchnorm layers collapsed in eval etc.

Numerical could of course also be involved but I see other potential reasons as more likely.

different data processing

The same data and same script were used on the cluster and locally so I don’t think this is it

model checkpointing failed (e.g. strict=False was used this not loading any parameters),

strict was set to True

batchnorm layers collapsed in eval etc.

This is where I am not sure; how can I check this?