Big Data & Performance


I have big dataset which has 118 numerical features , total number of samples in this dataset is 957920 , almost 9.5 million. All columns are float , i am using auto-encoder to reduce this curse of dimensionality by almost 60% i.e. 118 to 80.

9.5 million with 118 features is way more for google colab (12Gb RAM with 1 GPU node) i dont have GPU enabled machine , i resampled data by 50% to reduce size of data which turns to 4.7 million.

Here is my code for resampling


(478960, 118)

here is my autoencoder network

class Autoencoder(nn.Module):
    def __init__(self):

        self.encoder = nn.Sequential(

        self.decoder = nn.Sequential(
    def forward(self,x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded
    def get_encoder_state(self,x):
        encoded = self.encoder(x)
        return encoded

I am not using plain vanilla autoencoder , i am introducing some swap noise during dataset creation , during dataset creation some % of swapping rows in each batch from other rows for example if the full sample is

[[1 2]
 [3 4]
 [5 6]
 [7 8]]

and to split above full sample into 2 batch with 20% swap noise as follows

 tensor([[1., 6.],
        [3., 4.]], device='cuda:0')
 tensor([[1., 2.],
        [3., 4.]], device='cuda:0')

Above , first tensor is swaped by other rows randomly within each batch from other any rows from whole sample set. First tensor is given data in forward feed network and compare with for loss with second original tensor, it is like an augmentation

Second iteration has no swapping as 20% of 4 rows 0.8 which is equivalent to 1

tensor([[5., 6.],
        [7., 8.]], device='cuda:0')
 tensor([[5., 6.],
        [7., 8.]], device='cuda:0')

Hence given network is DAE , problem is each iteration taking atleast 3 to 4 minutes on google colab even if i go with 50% of given data set from 9.5 million to 4.5 million with a batch size of 1000. If i run 40 epochs then it will end 478960 / 1000 = 478 iteration in one epoch , imagine how long will it take to complete atleast one epoch and i dont want to judge based on one epoch

I am seeing loss is decreasing in each iteration , which is not in the case when i go with pure vanilla autoencoder , it means my latent space learning something meaningful from swapping noise

seed = 4

num_epochs = 40
outputs = []
running_loss = 0.0

for epoch in range(num_epochs):
    for data in data_loader:
        inputs, targets = data['x'].to(device), data['y'].to(device)
        recon = autoencoder(inputs)
        loss  = criterion(recon,targets) # calculate loss
        loss.backward()  # gradients = backward pass
        optimizer.step() # update weights 
        #if not  scheduler.__class__ ==  torch.optim.lr_scheduler.ReduceLROnPlateau:
        #    my_lr_scheduler.step()
        running_loss += loss.item()
        print(f'Epoch: {epoch+1}, Loss:{loss.item():.4f}, Running Loss:{running_loss}')

This much performance degradation is when i swap noise during dataset creation , without swap noise it is way faster one iteration is taking 1 sec to finish with the same records and batch size but loss gets stagnant after some epochs. How can i enhance performance ?

Is it ok to reduce sample from 50% random sample to 20% sample , but that way i doubt information loss from data. How can i enhance execution cycle to achieve above scenario?

Some couple of iterations from first epoch as an evidence loss decreasing , this is not the case with simple autoencoder , loss fluctuate in simple autoencoder and get stagnant after some epochs

Epoch: 1, Loss:0.2868, Running Loss:0.2867644727230072
Epoch: 1, Loss:0.1375, Running Loss:0.42426012456417084
Epoch: 1, Loss:0.0921, Running Loss:0.5163918137550354
Epoch: 1, Loss:0.0722, Running Loss:0.5885843113064766
Epoch: 1, Loss:0.0567, Running Loss:0.6453189551830292
Epoch: 1, Loss:0.0537, Running Loss:0.6990276090800762
Epoch: 1, Loss:0.0506, Running Loss:0.7496537789702415
Epoch: 1, Loss:0.0496, Running Loss:0.7992095649242401
Epoch: 1, Loss:0.0480, Running Loss:0.8471624776721001
Epoch: 1, Loss:0.0479, Running Loss:0.8950740993022919
Epoch: 1, Loss:0.0464, Running Loss:0.9414540380239487
Epoch: 1, Loss:0.0453, Running Loss:0.9867834560573101
Epoch: 1, Loss:0.0447, Running Loss:1.0314723961055279
Epoch: 1, Loss:0.0441, Running Loss:1.0755952261388302
Epoch: 1, Loss:0.0439, Running Loss:1.1194915883243084
Epoch: 1, Loss:0.0432, Running Loss:1.1627228409051895
Epoch: 1, Loss:0.0431, Running Loss:1.2058349698781967
Epoch: 1, Loss:0.0425, Running Loss:1.2483432218432426

Based on these statements it seems the massive slowdown is caused by the “noise swap”? If so, could you post the code you are using for this tranformation so that we could check why it’s so slow?

1 Like

@ptrblck here is my code , i noticed my code is not utilising GPU , though i am assigning device as GPU to model as well as dataset

class SwapDataset:
    def __init__(self, features, noise,batch_size):
        self.features = features
        self.noise = noise
        self.batch_size = batch_size
    def __len__(self):
        return (self.features.shape[0])
    def __getitem__(self, idx):
        sample = self.features[idx, :].copy()
        sample = self.features.copy()
        sample = self.swap_sample(sample)
        dct = {
            'x' : torch.tensor(sample[idx, :], dtype=torch.float).to(device),
            'y' : torch.tensor(self.features[idx, :], dtype=torch.float).to(device)
        return dct
    def swap_sample(self,sample):
        num_samples = self.features.shape[0]
        num_features = self.features.shape[1]

        if len(sample.shape) == 2:
            # batch_size = sample.shape[0]
            random_row = np.random.randint(0, num_samples, size=self.batch_size)

            for i in range(self.batch_size):
                random_col = np.random.rand(num_features) < self.noise
                sample[i,random_col] = self.features[random_row[i], random_col]
            batch_size = 1
            random_row = np.random.randint(0, num_samples, size=self.batch_size)
            random_col = np.random.rand(num_features) < self.noise
            #sample[:,random_col] = self.features[random_row, random_col]

        return sample

device   = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_dataset = SwapDataset(X1,0.02,batch_size=1000) # Swapping data 
data_loader   =, batch_size=1000, shuffle=False)

Based on your descriptions, you might be creating bottlenecks in the data processing, so a low GPU utilization might be expected.
Could you remove the GPU usage from the Dataset and profile the “swap” code?
I don’t know how large your tensors are, but it seems you are copying the entire self.features in each call and are also overwriting the previously indexed sample?