but when i do this my clients model stop updating. i think that’s because i changed the parameters i referenced in their optimizer.
now i’m doing it with
with torch.no_grad():
for i in range(number_of_clients):
state_dict = model_dict[name_of_models[i]].state_dict()
but when i do this i get the error ‘_IncompatibleKeys’ object has no attribute ‘train’ while training my model.
i will appreciate if anyone give me an advice on how to do this properly.
thanks
Since you mentioned federated learning, shouldn’t the data transfer be in a distributed environment? How about the following code snippet?
# Flattening all the parameters of the cloud model into a contiguous buffer to prepare for data transfer.
flat_params = torch.cat([p.data.view(-1) for p in model.parameters()])
# broadcast the tensors or call process group send/recv?
...
# Copy the parameters to the client model layer by layer.
offset = 0
for p in module.parameters():
p.data = flat_params[offset : offset + p.numel()].view_as(p)
offset += p.numel()
Thanks for replying
First of all i thought its not recommended to use .data in our code.
Second, sorry i didn’t understand what that offset is supposed to do. could you explain more?
in my scenario i have 10 clients or node and each one of them have their own model with the same architecture. my code works when i update client models layer by layer with the code below
You can change the model weights to a Tensor by parameters_to_vector function, communicate between ranks, revert that to weights by vector_to_parameters funciton.
I’m using a collective communication function to synchronize weight parameters on all GPUs but you can change that to point-to-point communication functions such as torch.distributed.send and torch.distributed.recv.
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
from torch.nn.utils import parameters_to_vector, vector_to_parameters
model = ...
# synchronize model parameters across nodes
vector = parameters_to_vector(model.parameters())
dist.broadcast(vector, 0) # broadcast parameters to other processes
if dist.get_rank() != 0:
vector_to_parameters(vector, model.parameters())