Time to analyze, problem hardware ? or libraries?

I have this hardware 11th Gen intel i5 2.40Ghz, 32 GB RAM, Disk ssd nvme and GeForce rtx 4090, and libraries Pytorch, difussers, accelerate and python 3.10, I have already made images with SDXL and now I’m trying load FLUX either Dev or Schnell the issue is that it takes a long time to load the steps for image generation, the pipeline loads very fast but it takes a long time even though I only add 5 steps. I have tested this configuration on a 4070 Ti and it also takes time but the average is between 7 to 9 minutes for three images, but in the case of the 4090, it takes up to 20 or 30 minutes. this is a very basic code but maybe you can get an idea of what to change.

print(torch.__version__)  # Torch version: 2.6.0+cu118
    print(torch.cuda.is_available())  #  True
    
    global pipe_flux
    torch.cuda.empty_cache()
    model_id_flux = "C:/Users/GFAdmin/Documents/model/flux.1-dev/"
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
    torch.cuda.set_device(0)
    #
    pipe_flux = FluxPipeline.from_pretrained(
        model_id_flux, 
        torch_dtype=torch.bfloat16
    )
    pipe_flux.to("cuda")
    pipe_flux.reset_device_map() 
    pipe_flux.enable_model_cpu_offload()

    print("Pipeline FLUX por defecto cargado.")
    
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    with autocast("cuda"), torch.inference_mode():
        image = pipe_flux(
                "A capybara holding a sign that reads Hello World",
                num_inference_steps=5,
                guidance_scale=3.5,
            ).images[0]
        image.save("./pruebas/capybara.png")

and another case I have other code used checkpoint file but not working too

import torch
    from diffusers import FluxPipeline, AutoencoderKL
    from transformers import T5EncoderModel
    torch.cuda.set_device(0)

    model_file = "C:/Users/GFAdmin/Documents/model/flux_checkpoint/flux_dev.safetensors"
    text_encoder = CLIPTextModel.from_pretrained("C:/Users/GFAdmin/Documents/model/clip-vit-base-patch32")
    text_encoder_2 = T5EncoderModel.from_pretrained("t5-base")
    vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema")
    

    pipe_flux = FluxPipeline.from_single_file(
        model_file,
        torch_dtype=torch.float16,
        use_safetensors=True,
        text_encoder=text_encoder,
        text_encoder_2=text_encoder_2,
        vae=vae
    )

    pipe_flux.to("cuda")
    pipe_flux.reset_device_map()
    pipe_flux.to(torch.float16)
    prompt = "A capybara holding a sign that reads Hello World"
    result = pipe_flux(prompt, num_inference_steps=5, guidance_scale=3.5)
    result.images[0].save("./pruebas/capybara.png")

    print("Imagen guardada en './pruebas/capybara.png'")

I will be grateful for any help

I would recommend profiling your code via e.g. Nsight Systems to narrow down the bottleneck of your application.