Can anyone share the code of stack auto-encoder? Please

Can anyone share the code of stack auto-encoder?

Click the first link:

Thanks but its a simple auto-encoder and I want the code for stack autoencoder

What is “stack autoencoder”?

Trying to implement this in pytorch

It’s called unsupervised pretraining, and they do it one layer at the time. No one does it like that anymore, because you get the same or better accuracy by training the same network from scratch (random initialization).

If you still insist on doing it like that (why?), then the code from the link I posted above is sufficient. Just remove upper layers, then add them back one by one.

Seems like you’re a beginner, so I suggest you take an online course first ( is pretty good).

1 Like

Of course, I’m a newcomer to this world :blush:. But I have to do it for some reasons, here is the code you posted, Could you please tell me which layers I have to remove and then add them back one by one.

import os

import torch
import torchvision
from torch import nn
from torch.autograd import Variable
from import DataLoader
from torchvision import transforms
from torchvision.datasets import MNIST
from torchvision.utils import save_image

if not os.path.exists(’./mlp_img’):

def to_img(x):
x = 0.5 * (x + 1)
x = x.clamp(0, 1)
x = x.view(x.size(0), 1, 28, 28)
return x

num_epochs = 100
batch_size = 128
learning_rate = 1e-3

img_transform = transforms.Compose([
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

dataset = MNIST(’./data’, transform=img_transform)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

class autoencoder(nn.Module):
def init(self):
super(autoencoder, self).init()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, 128),
nn.Linear(128, 64),
nn.ReLU(True), nn.Linear(64, 12), nn.ReLU(True), nn.Linear(12, 3))
self.decoder = nn.Sequential(
nn.Linear(3, 12),
nn.Linear(12, 64),
nn.Linear(64, 128),
nn.ReLU(True), nn.Linear(128, 28 * 28), nn.Tanh())

def forward(self, x):
    x = self.encoder(x)
    x = self.decoder(x)
    return x

model = autoencoder().cuda()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(
model.parameters(), lr=learning_rate, weight_decay=1e-5)

for epoch in range(num_epochs):
for data in dataloader:
img, _ = data
img = img.view(img.size(0), -1)
img = Variable(img).cuda()
# ===================forward=====================
output = model(img)
loss = criterion(output, img)
# ===================backward====================
# ===================log========================
print(‘epoch [{}/{}], loss:{:.4f}’
.format(epoch + 1, num_epochs,[0]))
if epoch % 10 == 0:
pic = to_img(output.cpu().data)
save_image(pic, ‘./mlp_img/image_{}.png’.format(epoch)), ‘./sim_autoencoder.pth’)

But why would you ever not opt for reducing the number of weights by 1/2 in your model? This is what old-school stacked AE acomplishes. Can you at the very least point to the experiments / papers that demonstrate the superiority or non-inferiority of back-to-back training (without shared parameters), besides the statement that “all cool kids do it these days”?

What are you talking about? How do you reduce weights with stacked AEs?

Actually, I was really wrong on this one. My apologies.

Early instances of (denoising) AE use exactly the same (transposed) weights for each decoder/encoder layer (but different biases). Basically described in all DL textbooks, happy to send the references. Sharing the transposed weights allows you to reduce the number of parameters by 1/2 (training each decoder/ encoder one layer at a time).

My point of reference was an unsupervised clustering paper from facebook (that used denoising AEs as a start):

Even though they trained the denoising (stacked) AE one layer at a time, they used separate parameters for the decoder and the encoder layers. So I was clearly wrong.

However, my question remains the same: where is the blog post, experiment, paper that shows that weight-sharing (b/ween encoding/decodin layers) in a stacked denoising AE gives absolutely no benefit? If you have a question about what is a stacked AE just read the referenced paper.

I’ve never actually said anything about weight sharing. I was merely pointing out that unsupervised pre-training for classification is unnecessary, and that AEs can be trained end to end, rather than one layer at a time. This has nothing to do with weight sharing.

Regarding weight sharing, it might act as a regularizer, which could help when the model is overparametrized. For small models it could lead to underfitting. I don’t have an intuition how it affects the training process.

More importantly, I’m not sure if you can share weights when you use convolutional layers in encoder and decoder: could the same filters be used for both convolution and “deconvolution” purposes? Do you know any papers that did it?

Got it! Makes sense. Any chance you are aware of any experiments that show this? Would be curious to read, no worries if you don’t have these handy.

Weight sharing is not possible at all in convolution-deconvolution scenario, as far as I understand it. Which is probably the reason why weight sharing has fallen out of favor in feedforward AEs as well. Even though I could not find any single references for this, I’ve noticed that at some point the feedforward AEs implementations moved away from weight sharing.