Solving an ode using pytorch

I am trying to solve an ode using pytorch. The ode has the form

du/dt = cos(2*3.14*t)

I parameterise my neural network as a two layer linear network
with tanh as an activation function in between. The layer takes in 1 dimensional input and returns 1 dimensional output with hidden layer size being 32.

    
def f(x):
    
    """
    function that computes the output of the neural net
    
    """
    
    l1 = torch.matmul(W1.T, x).reshape(-1, 1)
    l1_act = leakyrelu(l1)
    l2 = torch.matmul(W2.T, l1_act)
    return l2

def g(t):
    """
    the form of the solution chosen such that boundary condition
    u(0) = 1. is satisfied 
    """
    return t*f(t) + torch.tensor([1.])

def loss(t, eps):
    """
    the loss function which is simply the loss 
    on the gradient of the net being equal to the analytical gradient 
    value
    """
    return torch.mean(((g(t+eps) - g(t)) /(eps) - torch.cos(torch.tensor([2*3.14*t])))**2)

These are the model definition

x = torch.tensor([.01])
eps = torch.tensor([0.000345])
T0 = 0
T1 = 1
nsamples = 100
t = torch.linspace(T0, T1, nsamples)

W1 = Variable(torch.ones(1, 32), requires_grad=True)
W2 = Variable(torch.ones(32, 1), requires_grad = True)

I generate a 100 datapoints between 0 and 1 and train the network on these datapoints for 5000 epochs. My training loop looks something like this


learning_rate = 1e-3
for it in tqdm(range(5000)):
    err = 0
    for ti in t:
    
        ti = torch.tensor([ti])
        err += loss(ti, eps)
    
    err = err / nsamples
    err.backward()
    
    W1.data -= learning_rate * W1.grad.data
    W2.data -= learning_rate * W2.grad.data
    
    
    if it%100==0:
        print(err.item())
        grad_w1 =  W1.grad.data.detach().numpy().flatten()
        grad_w2 =  W2.grad.data.detach().numpy().flatten()
        fig, a= plt.subplots(1, 2, figsize =(6, 3))
        a[0].plot(grad_w1)
        a[1].plot(grad_w2)
        plt.show()
    
    W1.grad.data.zero_()
    W2.grad.data.zero_()

What I notice is that the gradient updates go to 0 and the network doesnt manage to learn the values. I had been following along in an example in Julia and there the code seems to work. I was wondering if there is something wrong in my specification. If someone can point me in the right direction it would be great. This is my second attempt to solve. I tried using the nn.Module along with nn.Linear layer along with Adam optimizer and there too the error seems to be the same