Running the Transformer Model NanoGPT with ‘MPS’ on MacOS 13.4 with an AMD Radeon Pro 5700 XT,
I started getting -Inf and NaNs after several thousand training iterations.
When I switched the backend from ‘MPS’ to ‘CPU’, there were no -Inf’s or -NaNs.
When I decreased the block_size and the batch_size, the problem stopped.
With the original larger block_size and batch_size, but, saving the tensor in the forward method it stopped the issue (change in timing?)
The -Infs first appeared in the
LayerNorm: F.layer_norm(input, self.weight.shape, self.weight, self.bias, 1e-5)
Does this sound like an ‘MPS’ resource issue?
Is there a way to display the resource usage in the Forward method?