I’m working with a dataset of tens of thousands of high-frequency time series, meaning each segment of one second of data is ~50 MB. I’m trying to avoid downsampling or other lossy strategies if at all possible. This obviously presents significant computational challenges, one of which is that any copying of the data incurs a significant time penalty.
Say the data is a 2D array where the rows are samples and the columns are the time series, and I have it all loaded into RAM (in reality, I have much more data than can fit in even regular memory, but let’s leave that aside for now).
In theory, this should mean I can use views into the full array to work with subsets of it rather than copying data, at least when they’re contiguous. And although I have much less GPU memory than regular memory, the same general principle should apply if the dataset is on the GPU.
One place where I can’t think of a way to avoid copying is in constructing minibatches of segments of data, where each segment is a continuous stretch of, say, one second of contiguous samples from every time series – i.e. a slice of the rows of the original array – but the segments making up the batch aren’t necessarily continuous.
Is there any way to do something like constructing a tensor for a batch where the first (batch size) dimension is actually just indexing views of row slices of another tensor?
Or, perhaps there wouldn’t be much benefit to using minibatches anyway, and I should just stick with pure (single-sample) stochastic gradient descent. It’s also entirely possible that I’m overly focused on this particular issue when others would make it moot (e.g. the computation will be much slower than the copying). At the moment I’m not using DataLoader, Dataset, or any other PyTorch data management tools, but if they would address this issue I’d be happy to hear it. Any thoughts on this subject would be much appreciated.
You analysis of when you copy or not is quite accurate.
- If you dataset is of size 100 for example and you want only to work with batches of size 10 and contiguous. You can actually create a Dataset of size 10. Where each idx corresponds to a slice of 10 samples in the true dataset. You can then get these batches randomly with your dataloader.
- If you use Dataset/Dataloader, you can use the builtin multiprocessing to load samples from disc asynchronously in multiple processes. See for example the
ImageFolder Dataset that loads each images one by one from disk and that is used for datasets like Imagenet. The multiprocessing allows such loader to still be able to feed data fast enough to the gpu to keep it busy.
Thanks for the reply!
Re your first point: so in my case (assuming all my data is in RAM), I could create a Dataset object that represents the entire dataset, as well as a DataLoader to generate batches from the dataset. I could set it up so that each batch returned by the DataLoader is a 3D tensor with dimensions
(batch_size, n_samples, n_time_series) and each
[i, :, :] slice of the batch is a view of a slice
[j:j+n_samples, :] of my original 2D array, so data in the original array does not need to be copied to create the batch, and I can pass the batch through my model function as if it were a normal tensor. Is that accurate?
Re your second point, I think in my case disk read speed is more likely to be the bottleneck than the number of CPUs. I guess for e.g. JPEG images that need to be uncompressed, the bottleneck is compute rather than I/O, so more processes would help, but in some cases multiple processes simultaneously trying to load e.g. multiple .npy arrays from a hard disk might actually hurt performance, right?
Re first point:
If you create a
Dataset that returns one sample for each index. Then the Dataloader will ask for each element one by one and will then concatenate them. So copies will occur.
You can go around this by making the
Dataset return a whole batch every time one sample is required. And make the Dataloader load samples one by one (but one sample from the dataset is actually a batch).
This is a hack around the existing elements for you to avoid copies.
Re second point:
Yes if you don’t do any preprocessing then it might be IO bound. unfortunately, there is not much you can do here but get a faster disk
Ok, thanks. By the way, just to make sure neither of us is misinterpreting the other’s terminology: since my data is time series, I was using the term “sample” to refer to a single measurement element of one or more time series (equivalent to one pixel for image data). But I want to train on a segment (or batch of segments) of, say, one second of continuous samples (equivalent to one image). I’ll refer to each element of a batch is a time segment.
So if I make the Dataset return a whole batch, as you describe, is it possible to have the Dataset make the batch out of multiple time segments, where the segments aren’t continuous with each other (but the samples within each segment are), and avoid copying?
I’m afraid it is not It will be once NestedTensor is merged though
If you want to make sure not to do any extra copy, I would advise not to use advanced indexing (like
data[i, j, :]) and only use the
.narrow() methods. If you only use these two methods, you are 100% sure you’re not doing any copy. You will see that if you cannot use these methods to do what you want, it usually means that you have to copy
I think I could use
.select() to get the views of the original array that I want, but that doesn’t solve the problem of compiling them together into a batch tensor that can be propagated through a model without copying data. NestedTensor looks like a useful project – hope that continues development.
I don’t think
.slice() exists. But
.narrow() can be used to get a set of contiguous indices.
There is no way currently to aggregate Tensors that don’t represent contiguous storage into another Tensor.
NestedTensor is actively developed and should be fully implemented in the coming months.