About the memory: With AMP’s automatic casting so for very small networks, the memory cost of the tensors that are there twice (in FP32 originally and cast to FP16) might be larger than the benefit. (Assuming you use PyTorch’s memory counting, if you use nvidia-smi you’d might also run into caching allocator things.