CPU and GPU tensor norm different

torch.manual_seed(0)
W = torch.FloatTensor(10, 5000).normal_(0,1)

print(W.norm(2) ** 2, W.sum())
W = W.cuda()

print(W.norm(2) ** 2, W.sum())

W = W.cpu()
print(W.norm(2) ** 2, W.sum())

When I run this code, the results is

tensor(97479.4609) tensor(-358.1119)
tensor(50004.2148, device='cuda:0') tensor(-358.1118, device='cuda:0')
tensor(97479.4609) tensor(-358.1119)

The results of norm is significantly different, but the results of sum is almost the same.
When I change the size of tensor W from (10,5000) to (10,1000). I get the results:

tensor(10038.8496) tensor(-106.9817)
tensor(10038.8516, device='cuda:0') tensor(-106.9818, device='cuda:0')
tensor(10038.8496) tensor(-106.9817)

The results of norm is almost the same.
I want to figure out how to explain these results?

Interesting results! But I try to do it on Google Colab, the problem occured on large tensor.

import torch
torch.manual_seed(0)
W = torch.FloatTensor(100, 5000000).normal_(0,1)

print(W.norm(2) ** 2, W.sum())
W = W.cuda()

print(W.norm(2) ** 2, W.sum())

W = W.cpu()
print(W.norm(2) ** 2, W.sum())
!nvidia-smi
tensor(1.5166e+08) tensor(-34600.6719)
tensor(5.0001e+08, device='cuda:0') tensor(-34600.1719, device='cuda:0')
tensor(1.5166e+08) tensor(-34600.6719)
Mon May 27 07:17:41 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    31W /  70W |   3244MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

It seems CPU side problem.

import torch
HIGHT = 100
WIDTH = 5000000
torch.manual_seed(0)
W = torch.FloatTensor(HIGHT, WIDTH).normal_(0,1)
Wv = W.view(HIGHT*WIDTH,1)
WTW = Wv * Wv
print(WTW.sum())
print(W.norm(2) ** 2, W.sum())

W = W.cuda()
Wv = W.view(HIGHT*WIDTH,1)
WTW = Wv * Wv
print(WTW.sum())
print(W.norm(2) ** 2, W.sum())

W = W.cpu()
print(W.norm(2) ** 2, W.sum())
!nvidia-smi
tensor(4.8664e+08)
tensor(1.5166e+08) tensor(-34600.6719)
tensor(5.0001e+08, device='cuda:0')
tensor(5.0001e+08, device='cuda:0') tensor(-34600.1719, device='cuda:0')
tensor(1.5166e+08) tensor(-34600.6719)
Mon May 27 07:41:05 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |   5152MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

Oh, thanks, I understand. It is a bug on CPU side and GPU side has no problem.