During deep learning Training, the update of declared values with this variable is done under with torch.no _grad().
When learning with ddp, it’s the process of obtaining the average for each batch sample at each gpu, so no synchronization is required. Is there a difference between the two? register_buffer performs worse. It’s not like gradient is flowing, so I don’t know why the performance difference is significant.
example in model init
- self.matrix = nn.Parameter(torch.zeros(197, 192).cuda(), requires_grad=False)
- self.matrix = torch.zeros(197, 192).cuda()
- self.register_buffer(‘matrix’, torch.zeros(197, 192))
1 and 3 are same performence, but 2 is different