Why tensor.to convert fp32 to fp8_e4m3=Nan if overflow

symbolics · November 7, 2024, 8:09am

1、source code:
bit_tensor = torch.tensor([0x7f7fffff, 0x7f800000, 0x7f800001], dtype = torch.uint32)
src_tensor = bit_tensor.view(torch.float32).clone().to(device)
dst_tensor = src_tensor.to(torch.float8_e4m3fn)

2、the result is:
dst_tensor is tensor([[nan, nan, nan]], device=‘cuda:0’, dtype=torch.float8_e4m3fn)

3、cuda has satfinite to max_normal if overflow，behavior has different. any reason here?

paulge · November 7, 2024, 10:29am

When calling bit_tensor.view(torch.float32) you already have three numbers that cannot be represented in float32… The first one barely fits and the other two don’t fit anymore. Meaning you are moving this tensor tensor([3.4028e+38, inf, nan]) to your cuda device. That’s the issue. If you call tensor.view(torch.float32) on your uint32 tensor you are going to see this.

symbolics · November 7, 2024, 11:14am

sure, actually, I want to convert tensor([3.4028e+38, inf, nan], dtype = torch.float32) to tensor( dtype = torch.float8_e4m3fn),.
but tensor.to() get the result tensor([[nan, nan, nan]], dtype=torch.float8_e4m3fn) in which nan=0x7F. this is diff from cuda-c with result: [ 3.4028e+38, inf] to max_normal=0x7E. any reason for this gap？