Help me understand CUDA multiprocessing

Hey everyone!
I’m trying to build a docker container with a small server that I can use to run stable diffusion.
I’ve used the example code from banana.dev as a base and have uploaded my container to runpod.
In the server, I first call a function that initialises the model so it is available as soon as the server is running:

from sanic import Sanic, response
import subprocess
import app as user_src
import torch

# We do the model load-to-GPU step on server startup
# so the model object is available globally for reuse
user_src.init()

# Create the http server app
server = Sanic("my_app")

@server.route('/', methods=["POST"]) 
def inference(request):
    try:
        model_inputs = response.json.loads(request.json)
    except:
        model_inputs = request.json

    output = user_src.inference(model_inputs)

    return response.json(output)


if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn', force=True)
    server.run(host='0.0.0.0', port=8000, workers=1)

the actual model is defined in the app.py in which first the model is initialised:


def init():
    global model
    HF_AUTH_TOKEN = os.getenv("HF_AUTH_TOKEN")
    
    # this will substitute the default PNDM scheduler for K-LMS  
    lms = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear")

    model = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", scheduler=lms, use_auth_token=HF_AUTH_TOKEN).to('cuda')

and then the inference function uses the global model variable to run inference:

def inference(model_inputs:dict) -> dict:
    global model

[...]
    
    # Run the model
    with autocast('cuda'):
        images = model(
            prompt,
            width=width,
            height=height,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale
        )["sample"]

[...]

Every time I’m trying to run inference, I’m getting a RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method error, even though I have a) added the call to use the spawn start method to the server, and b) am (to my understanding) not using multiprocessing. Can you help me understand how CUDA initializes under the hood and how I can fix these errors?

Best and thank you,

Sami

I have the same problem, did you solve it?

Yes, I solved it by not using Sanic anymore. We’re now using a sns message queue from aws, which does not spawn a child process. Maybe you can try using a Flask server or something to avoid the problem :slight_smile: