Real time inference on an SG3 pytorch GAN

Currently developing a flask server in Python to serve generated images in real time. I know this is possible because RunwayML does so at about 90 milliseconds per frame for a 1024x1024 generator. Right now I am running some tests and noticed that it is very slow (like 1000 ms). Is this just because I am saving the image and that is causing the bottleneck? Is it something else? What is the fastest way to pass the image to another application running on the same machine?

# blah blah, if Flask receives a jquery post containing an array I use for Z ...
    img = G(z, c, trunc)                           # NCHW, float32, dynamic range [-1, +1], no truncation
    img = (img.permute(0, 2, 3, 1) * 127.5 + 128).clamp(0, 255).to(torch.uint8)
    PIL.Image.fromarray(img[0].cpu().numpy(), 'RGB').save('newtest.png')

It looks like Runway’s C++ lib receives the image as raw bytes and then re-encodes as an image

||// Save the incoming pixels to a buffer using JPG compression.|
|---|---|
||ofBuffer compressedPixels;|
||ofSaveImage(pixelsToReceive, compressedPixels, (type == OFX_RUNWAY_JPG)?OF_IMAGE_FORMAT_JPEG:OF_IMAGE_FORMAT_PNG);|
|||
||// Encode the compressed pixels in base64.|
||ofxIO::ByteBuffer base64CompressedPixelsIn;|
||ofxIO::Base64Encoding base64Encoder (false, false, true);|
||base64Encoder.encode(ofxIO::ByteBuffer(compressedPixels.getData(), compressedPixels.size()), base64CompressedPixelsIn);|

It sounds like you’re trying to build an inferencing framework from scratch - have you considered using something like https://github.com/pytorch/serve