Hi,
I have big dataset which has 118 numerical features , total number of samples in this dataset is 957920 , almost 9.5 million. All columns are float , i am using auto-encoder to reduce this curse of dimensionality by almost 60% i.e. 118 to 80.
9.5 million with 118 features is way more for google colab (12Gb RAM with 1 GPU node) i dont have GPU enabled machine , i resampled data by 50% to reduce size of data which turns to 4.7 million.
Here is my code for resampling
X1=np.array(pd.DataFrame(X).sample(frac=0.5,random_state=41))
X1.shape
(478960, 118)
here is my autoencoder network
class Autoencoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(118,100),
nn.BatchNorm1d(100),
nn.LeakyReLU(0.01),
nn.Linear(100,80),
nn.BatchNorm1d(80),
nn.LeakyReLU(0.01),
nn.Linear(80,60)
)
self.decoder = nn.Sequential(
nn.Linear(60,80),
nn.LeakyReLU(0.01),
nn.Linear(80,100),
nn.LeakyReLU(0.01),
nn.Linear(100,118)
)
def forward(self,x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
def get_encoder_state(self,x):
encoded = self.encoder(x)
return encoded
I am not using plain vanilla autoencoder , i am introducing some swap noise during dataset creation , during dataset creation some % of swapping rows in each batch from other rows for example if the full sample is
[[1 2]
[3 4]
[5 6]
[7 8]]
and to split above full sample into 2 batch with 20% swap noise as follows
tensor([[1., 6.],
[3., 4.]], device='cuda:0')
target
tensor([[1., 2.],
[3., 4.]], device='cuda:0')
Above , first tensor is swaped by other rows randomly within each batch from other any rows from whole sample set. First tensor is given data in forward feed network and compare with for loss with second original tensor, it is like an augmentation
Second iteration has no swapping as 20% of 4 rows 0.8 which is equivalent to 1
tensor([[5., 6.],
[7., 8.]], device='cuda:0')
target
tensor([[5., 6.],
[7., 8.]], device='cuda:0')
Hence given network is DAE , problem is each iteration taking atleast 3 to 4 minutes on google colab even if i go with 50% of given data set from 9.5 million to 4.5 million with a batch size of 1000. If i run 40 epochs then it will end 478960 / 1000 = 478 iteration in one epoch , imagine how long will it take to complete atleast one epoch and i dont want to judge based on one epoch
I am seeing loss is decreasing in each iteration , which is not in the case when i go with pure vanilla autoencoder , it means my latent space learning something meaningful from swapping noise
seed = 4
torch.manual_seed(seed)
num_epochs = 40
outputs = []
running_loss = 0.0
for epoch in range(num_epochs):
for data in data_loader:
inputs, targets = data['x'].to(device), data['y'].to(device)
recon = autoencoder(inputs)
loss = criterion(recon,targets) # calculate loss
optimizer.zero_grad()
loss.backward() # gradients = backward pass
optimizer.step() # update weights
#if not scheduler.__class__ == torch.optim.lr_scheduler.ReduceLROnPlateau:
# my_lr_scheduler.step()
running_loss += loss.item()
print(f'Epoch: {epoch+1}, Loss:{loss.item():.4f}, Running Loss:{running_loss}')
This much performance degradation is when i swap noise during dataset creation , without swap noise it is way faster one iteration is taking 1 sec to finish with the same records and batch size but loss gets stagnant after some epochs. How can i enhance performance ?
Is it ok to reduce sample from 50% random sample to 20% sample , but that way i doubt information loss from data. How can i enhance execution cycle to achieve above scenario?
Some couple of iterations from first epoch as an evidence loss decreasing , this is not the case with simple autoencoder , loss fluctuate in simple autoencoder and get stagnant after some epochs
Epoch: 1, Loss:0.2868, Running Loss:0.2867644727230072
Epoch: 1, Loss:0.1375, Running Loss:0.42426012456417084
Epoch: 1, Loss:0.0921, Running Loss:0.5163918137550354
Epoch: 1, Loss:0.0722, Running Loss:0.5885843113064766
Epoch: 1, Loss:0.0567, Running Loss:0.6453189551830292
Epoch: 1, Loss:0.0537, Running Loss:0.6990276090800762
Epoch: 1, Loss:0.0506, Running Loss:0.7496537789702415
Epoch: 1, Loss:0.0496, Running Loss:0.7992095649242401
Epoch: 1, Loss:0.0480, Running Loss:0.8471624776721001
Epoch: 1, Loss:0.0479, Running Loss:0.8950740993022919
Epoch: 1, Loss:0.0464, Running Loss:0.9414540380239487
Epoch: 1, Loss:0.0453, Running Loss:0.9867834560573101
Epoch: 1, Loss:0.0447, Running Loss:1.0314723961055279
Epoch: 1, Loss:0.0441, Running Loss:1.0755952261388302
Epoch: 1, Loss:0.0439, Running Loss:1.1194915883243084
Epoch: 1, Loss:0.0432, Running Loss:1.1627228409051895
Epoch: 1, Loss:0.0431, Running Loss:1.2058349698781967
Epoch: 1, Loss:0.0425, Running Loss:1.2483432218432426