Hi, I am not sure if we call a layer define in the __init__ for multiple times, do they share weights in the traning? For example, we have a function fc1 defined as:
I am also trying to do the same for LLMs but when I use the same module two times in forward pass, the shared memory in the GPUs just gets doubled. If they were using same shared parameter then it would not happen right??
No, that not correct since forward activations will be stored if these are needed for the gradient computation in the backward pass. Reusing parameters does not change this behavior.