End-to-end training of encoder-decoder architectures

Hi,
I am going through the advanced pytorch tutorial on image-captioning.


The training module combines the parameters of CNN and RNNs as follow and give them to optimizer.

criterion = nn.CrossEntropyLoss()
params = list(decoder.parameters()) + list(encoder.linear.parameters()) + list(encoder.bn.parameters())
optimizer = torch.optim.Adam(params, lr=args.learning_rate)

# Train the Models
total_step = len(data_loader)
for epoch in range(args.num_epochs):
    for i, (images, captions, lengths) in enumerate(data_loader):
        
        # Set mini-batch dataset
        images = to_var(images, volatile=True)
        captions = to_var(captions)
        targets = pack_padded_sequence(captions, lengths, batch_first=True)[0]
        
        # Forward, Backward and Optimize
        decoder.zero_grad()
        encoder.zero_grad()
        features = encoder(images)
        outputs = decoder(features, captions, lengths)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

Is it considered End-To-End training? Apparently it seems that the CNN and RNNs are being trained separatly to minimize the loss function. I am not able to see how the gradients from RNNs flow back to CNN. Does the pytorch takes care of the flow of gradient from RNN to CNN or it just uses the gradient of the loss and update all the parameters at once?

Also is there any example that implements the backward pass in pytorch for encoder-decoder architectures?