tensorboardX SummaryWriter not working when using gpu

ianben6688 · September 8, 2020, 1:14am

pytorch: 1.1.0
tensorboardX: 2.1

the code is like following:

import torch
from torch import nn
from torch.optim import adam
from tensorboardX import SummaryWriter

device = "cuda" if torch.cuda.is_available() else "cpu"

net = Model()
net.to(device)
loss_fn = nn.BCELoss()  # MSELoss()
optimizer = adam.Adam(params=net.parameters(), lr=0.0001, weight_decay=0.5)

writer = SummaryWriter("logs")
for epoch in range(50):
    for i, x_batch, y_batch in enumerate(train_loader):
        y_pred = net(x_batch.to(device))
        loss = loss_fn(y_pred, x_batch)

        writer.add_scalar("loss/train", loss, global_step=epoch * len(train_loader) + i)
        writer.flush()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

writer.close()

I want to see the loss change during the training rather than after the training.

Firstly I trained the model in Titan-xp. When I check the event file during training, the size is 0. In the tensorboard webpage, I got ‘No scalar data was found’. When I terminate the training, the event becomes 100K and I can see the loss in the tensorboard webpage.

Then I trainined the model in cpu, and everything works fine. Can anyone tell me what the problem it is?

pfloat · September 8, 2020, 2:15am

Hello,
I have met such problem before. In my case, writer.flush() solved the problem but sometimes, eventhough several add_scalar have been done, it’s not possible to visualize the result.

Could the problem be in the loss argument in add_scalar since it should be a torch.tensor?
Maybe using loss.item() solves your issue ?

Cheers

ianben6688 · September 8, 2020, 2:17am

Thank you. I tried, it still doesn’t work.

pfloat · September 8, 2020, 2:22am

How about the max_queue argument set to 1?

writer = SummaryWriter("logs", maxqueue=1)

https://tensorboardx.readthedocs.io/en/latest/tensorboard.html

The default argument is 10 so the writer should at least log 5 times to disk… that’s weird anyway

EDIT: just found out the flush_secs=120 default argument in the writer documentation

ianben6688 · September 8, 2020, 2:26am

Thank you. I just tried. Still not work.

pfloat · September 8, 2020, 2:27am

this ?
writer = SummaryWriter("logs", maxqueue=1, flush_secs=1)

ianben6688 · September 8, 2020, 2:31am

Yes, I change it to:

writer = SummaryWriter(check_point_dir, flush_secs=1, write_to_disk=True, max_queue=1)

Still doesn’t work. When I terminate the training or train it on cpu, the summary can be written to disk. So wired.

ianben6688 · September 8, 2020, 2:33am

I think the summary data is in the memory. Just cannot be written to disk.

pfloat · September 8, 2020, 2:48am

well, I think we cannot do much about it for now to make this work. It seems that there are a lot of people with the same problems online. See:
https://github.com/lanpa/tensorboardX/pull/451

ianben6688 · September 8, 2020, 2:52am

Thanks for your patient reply. I changed from tensorboardX import SummaryWriter to from torch.utils.tensorboard import SummaryWriter and it works. That’s so wired.

Thank you so much.