Sparse Optimization DataLoader

Edouard360 · May 10, 2018, 11:14pm

I have written a small gist of a self-contained experiment:

gist.github.com

https://gist.github.com/Edouard360/b169925b5b352b8e7e964e67d361f2b4

sparse_optimization.py

import numpy as np
import scipy.sparse as sp
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset


def one_hot(index, n_cat):
    onehot = torch.zeros(index.size(0), n_cat, device=index.device)
    onehot.scatter_(1, index.type(torch.long), 1)

This file has been truncated. show original

I am working with high dimensionality (~50000). This profile below is that of the train procedure with num_workers = 0.

I am wondering how much I could optimize the dataloader iteration step which looks time-consuming to me.

Adding more workers doesn’t seem to change anything, even with torch.multiprocessing.set_start_method(“spawn”) ( which by the way had to be set with flag force=True, otherwise raising a “context already set” error ).

I have in mind a blog article (Optimizing PyTorch training code - Sagivtech), where they claim that “no time is spent on data-loading / preprocessing”. But it didn’t make things any faster in my case.

When reducing significantly the dimensionality, say to 1000 (instead of 50000), the dataloader is really minor for the overall time of the training procedure. So, is the behaviour in the first expected or would there be any way to reduce that time ?