Hi guys, currently I have a model with a lot of classes on the output layer (20k classes) and I’m having some difficulties to use DataParallel, mainly because the first GPU is getting OOM.

Is it possible to have Data parallel, but doing the aggregation on the CPU instead of GPU? If not there is a way to have some sort of Mix between Data/Model parallel?

The increased memory on the default device is expected due to the scattering of the larger batch from this device.
We generally recommend to use DistributedDataParallel with a single process per device as described here for the best performance (and to avoid the imbalanced memory usage).

Thanks @ptrblck, I will try it right away. Besides that do you know any other technique to avoid having one FC layer with a lot of output neurons (ie: num_classes = 20k)?

Initially I was thinking on returning a pair of numbers that would somehow map to the class I needed to classify but I didn’t tried that yet. (Like changing the problem to a regression problem).

@leonardoaraujosantos you could try predicting bit-vectors i.e. create N = np.ceil(log2(num_classes)) = 15 (if num_classes=20k) outputs with each one being a sigmoid. The loss would be per-bit binary cross-entropy and the output probabilities would be converted back to 0/1s using some threshold.

Empirically, this worked well on some experiments with small datasets (MNIST and Fashion-MNIST) but of course, they only have 10 classes.

Cool Sanjay, that’s a very nice idea, do you know any paper that present this idea? (Or implementation).
I wonder if would be good to implement some kind of error correction (like adding redundancy bits) to avoid loosing accuracy.

I don’t know of a paper (although I am sure this idea has been revisited many times before) but in terms of implementation, something like below would work:

Functions to convert labels integers back and forth from base 10 to binary representations:

def convert_to_binary(labels):
'''Convert base 10 int to binary representation
'''
binrep = []
for d in labels:
rep = []
for i in np.binary_repr(d, width=3):
rep.append(int(i))
binrep.append(rep)
binrep = torch.Tensor(binrep).float().to(device)
return binrep
def convert_to_decimal(binrep):
'''Convert binary representation to base 10 int
'''
labels = []
for d in binrep:
labels.append(int("".join([str(int(i)) for i in d]), base=2))
return labels

and the training loop would just be:

for epoch in range(N_epochs):
loss_list = []
for idx, (data_example, data_target) in enumerate(train_dataloader):
data_example = data_example.to(device)
data_target = data_target.to(device)
data_target = convert_to_binary(data_target)
pred = net(data_example)
loss = criterion(pred, data_target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_list.append(loss.item())

The predictions (if the input tensor has N examples) would be a tensor of shape: (N, k) where k = log_2(num_classes) and are all float values between 0 and 1. To compute metrics like accuracy, you would have to convert them back to decimals (or convert the labels to binary) using a threshold (say, 0.5) to first convert the scores to 0s and 1s and then converting the bits to decimals using convert_to_decimal above. Hope this helps!