I have trained a segmentation model on a large cluster (don’t have the exact specs at the moment but think 512GB RAM, multiple Ampere/Volta GPUs, etc).
When I run inference on the cluster I get excellent results. When I run inference on the same data on my local GPU or CPU machine(s) I get significantly worse results.
In researching this I came across the topic of floating point precision: FP32, TF32, etc…
I cannot rely on the cluster to do inference. What is my best course of action?
I assume, from my limited understanding that I need to retrain with TF32 OFF and probably train for longer to achieve comparable results?