ValueError: torch.cuda.is_available() should be True but is False

awolter27 · January 12, 2024, 8:16pm

I’m trying to train a custom model using kohya. I have been running into the same error for over 1 week now. I’ve attached photos of my error message as well as some information about my computer and the applications on it. I’m thinking I have an issue with my setup, but I’m not sure what it is or how to fix it. Any help would be much appreciated!!!

Error Message:

Traceback (most recent call last):
  File "D:\Kohya\kohya_ss\sdxl_train_network.py", line 189, in <module>
    trainer.train(args)
  File "D:\Kohya\kohya_ss\train_network.py", line 242, in train
    vae.set_use_memory_efficient_attention_xformers(args.xformers)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 263, in set_use_memory_efficient_attention_xformers
    fn_recursive_set_mem_eff(module)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 259, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 259, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 259, in fn_recursive_set_mem_eff
    fn_recursive_set_mem_eff(child)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\modeling_utils.py", line 256, in fn_recursive_set_mem_eff
    module.set_use_memory_efficient_attention_xformers(valid, attention_op)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\diffusers\models\attention_processor.py", line 255, in set_use_memory_efficient_attention_xformers
    raise ValueError(
ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU
Traceback (most recent call last):
  File "C:\Users\Allison\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Allison\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\Kohya\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
    simple_launcher(args)
  File "D:\Kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\\Kohya\\kohya_ss\\venv\\Scripts\\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=D:/Kohya/kohya_ss/sdXL_v10VAEFix.safetensors', '--train_data_dir=C:/Users/Allison/Downloads/jrm/jrm_lora\\img', '--resolution=1024, 1024', '--output_dir=C:/Users/Allison/Downloads/jrm/jrm_lora\\model', '--logging_dir=C:/Users/Allison/Downloads/jrm/jrm_lora\\log', '--network_alpha=64', '--training_comment=jrm', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=4e-07', '--unet_lr=0.0001', '--network_dim=128', '--output_name=jrm_lora', '--lr_scheduler_num_cycles=1', '--no_half_vae', '--learning_rate=4e-07', '--lr_scheduler=constant_with_warmup', '--lr_warmup_steps=200', '--train_batch_size=1', '--max_train_steps=2001', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=Adafactor', '--max_grad_norm=1', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--mem_eff_attn', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.

nvcc --version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

nvidia-smi:

python -m torch.utils.collect_env:

Collecting environment information...
PyTorch version: 2.1.2+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Home
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 12.3.107
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070
Nvidia driver version: 546.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=3200
DeviceID=CPU0
Family=107
L2CacheSize=4096
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=3200
Name=AMD Ryzen 7 2700 Eight-Core Processor
ProcessorType=3
Revision=2050

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] torch==2.1.2+cu118
[pip3] torchaudio==2.1.2+cu118
[pip3] torchvision==0.16.2+cu118
[conda] Could not collect

ptrblck · January 12, 2024, 8:50pm

Your output seems to indicate different PyTorch versions are used.
In your collect_env output it shows:

Is CUDA available: True

which is set via cuda_available_str = str(torch.cuda.is_available()) from here and indicates torch.cuda.is_available() returns True when calling collect_env.
However, your stacktrace shows the opposite:

ValueError: torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only available for GPU

so I assume another PyTorch binary is found and used in this environment.

awolter27 · January 12, 2024, 9:05pm

Thank you so much for this insight! Do you know how I would go about fixing this issue? I’m new to all this.

ptrblck · January 12, 2024, 10:00pm

I would probably start by checking how many virtual environments you are using and which PyTorch version is used by your script.
E.g. start your default environment (assuming you are using more than one) and run:

python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"

Then add the print statements to the beginning of the script you are executing which failed, i.e. into D:\Kohya\kohya_ss\sdxl_train_network.py and run it again.

awolter27 · January 12, 2024, 11:00pm

You have been so helpful! I believe I am using two virtual environments. They both appear to print “True” and they both appear to be running version 12.1 (if I ran those commands correctly). I keep hitting the same exact error (in the photo above). I even uninstalled and reinstalled pytorch to try to confirm the versions were all consistent. I also restarted everything (in the slight chance that this was the cause of the error), yet I’m still getting that same error.

Virtual Environment #1 (kohya):

 D:\Kohya\kohya_ss\venv>python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"
2.1.2+cu121 12.1 True

Virtual Environment #2 (stable diffusion):

D:\Stable Diffusion\stable-diffusion-webui\venv>python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available())"
2.1.2+cu121 12.1 True

sdxl_train_network.py:

import argparse
import torch

try:
    import intel_extension_for_pytorch as ipex

    if torch.xpu.is_available():
        from library.ipex import ipex_init

        ipex_init()
except Exception:
    pass
from library import sdxl_model_util, sdxl_train_util, train_util
import train_network


class SdxlNetworkTrainer(train_network.NetworkTrainer):
    def __init__(self):
        super().__init__()
        self.vae_scale_factor = sdxl_model_util.VAE_SCALE_FACTOR
        self.is_sdxl = True

    def assert_extra_args(self, args, train_dataset_group):
        super().assert_extra_args(args, train_dataset_group)
        sdxl_train_util.verify_sdxl_training_args(args)

        if args.cache_text_encoder_outputs:
            assert (
                train_dataset_group.is_text_encoder_output_cacheable()
            ), "when caching Text Encoder output, either caption_dropout_rate, shuffle_caption, token_warmup_step or caption_tag_dropout_rate cannot be used / Text Encoderの出力をキャッシュするときはcaption_dropout_rate, shuffle_caption, token_warmup_step, caption_tag_dropout_rateは使えません"

        assert (
            args.network_train_unet_only or not args.cache_text_encoder_outputs
        ), "network for Text Encoder cannot be trained with caching Text Encoder outputs / Text Encoderの出力をキャッシュしながらText Encoderのネットワークを学習することはできません"

        train_dataset_group.verify_bucket_reso_steps(32)

    def load_target_model(self, args, weight_dtype, accelerator):
        (
            load_stable_diffusion_format,
            text_encoder1,
            text_encoder2,
            vae,
            unet,
            logit_scale,
            ckpt_info,
        ) = sdxl_train_util.load_target_model(args, accelerator, sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, weight_dtype)

        self.load_stable_diffusion_format = load_stable_diffusion_format
        self.logit_scale = logit_scale
        self.ckpt_info = ckpt_info

        return sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, [text_encoder1, text_encoder2], vae, unet

    def load_tokenizer(self, args):
        tokenizer = sdxl_train_util.load_tokenizers(args)
        return tokenizer

    def is_text_encoder_outputs_cached(self, args):
        return args.cache_text_encoder_outputs

    def cache_text_encoder_outputs_if_needed(
        self, args, accelerator, unet, vae, tokenizers, text_encoders, dataset: train_util.DatasetGroup, weight_dtype
    ):
        if args.cache_text_encoder_outputs:
            if not args.lowram:
                # メモリ消費を減らす
                print("move vae and unet to cpu to save memory")
                org_vae_device = vae.device
                org_unet_device = unet.device
                vae.to("cpu")
                unet.to("cpu")
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()

            # When TE is not be trained, it will not be prepared so we need to use explicit autocast
            with accelerator.autocast():
                dataset.cache_text_encoder_outputs(
                    tokenizers,
                    text_encoders,
                    accelerator.device,
                    weight_dtype,
                    args.cache_text_encoder_outputs_to_disk,
                    accelerator.is_main_process,
                )

            text_encoders[0].to("cpu", dtype=torch.float32)  # Text Encoder doesn't work with fp16 on CPU
            text_encoders[1].to("cpu", dtype=torch.float32)
            if torch.cuda.is_available():
                torch.cuda.empty_cache()

            if not args.lowram:
                print("move vae and unet back to original device")
                vae.to(org_vae_device)
                unet.to(org_unet_device)
        else:
            # Text Encoderから毎回出力を取得するので、GPUに乗せておく
            text_encoders[0].to(accelerator.device)
            text_encoders[1].to(accelerator.device)

    def get_text_cond(self, args, accelerator, batch, tokenizers, text_encoders, weight_dtype):
        if "text_encoder_outputs1_list" not in batch or batch["text_encoder_outputs1_list"] is None:
            input_ids1 = batch["input_ids"]
            input_ids2 = batch["input_ids2"]
            with torch.enable_grad():
                # Get the text embedding for conditioning
                # TODO support weighted captions
                # if args.weighted_captions:
                #     encoder_hidden_states = get_weighted_text_embeddings(
                #         tokenizer,
                #         text_encoder,
                #         batch["captions"],
                #         accelerator.device,
                #         args.max_token_length // 75 if args.max_token_length else 1,
                #         clip_skip=args.clip_skip,
                #     )
                # else:
                input_ids1 = input_ids1.to(accelerator.device)
                input_ids2 = input_ids2.to(accelerator.device)
                encoder_hidden_states1, encoder_hidden_states2, pool2 = train_util.get_hidden_states_sdxl(
                    args.max_token_length,
                    input_ids1,
                    input_ids2,
                    tokenizers[0],
                    tokenizers[1],
                    text_encoders[0],
                    text_encoders[1],
                    None if not args.full_fp16 else weight_dtype,
                    accelerator=accelerator,
                )
        else:
            encoder_hidden_states1 = batch["text_encoder_outputs1_list"].to(accelerator.device).to(weight_dtype)
            encoder_hidden_states2 = batch["text_encoder_outputs2_list"].to(accelerator.device).to(weight_dtype)
            pool2 = batch["text_encoder_pool2_list"].to(accelerator.device).to(weight_dtype)

            # # verify that the text encoder outputs are correct
            # ehs1, ehs2, p2 = train_util.get_hidden_states_sdxl(
            #     args.max_token_length,
            #     batch["input_ids"].to(text_encoders[0].device),
            #     batch["input_ids2"].to(text_encoders[0].device),
            #     tokenizers[0],
            #     tokenizers[1],
            #     text_encoders[0],
            #     text_encoders[1],
            #     None if not args.full_fp16 else weight_dtype,
            # )
            # b_size = encoder_hidden_states1.shape[0]
            # assert ((encoder_hidden_states1.to("cpu") - ehs1.to(dtype=weight_dtype)).abs().max() > 1e-2).sum() <= b_size * 2
            # assert ((encoder_hidden_states2.to("cpu") - ehs2.to(dtype=weight_dtype)).abs().max() > 1e-2).sum() <= b_size * 2
            # assert ((pool2.to("cpu") - p2.to(dtype=weight_dtype)).abs().max() > 1e-2).sum() <= b_size * 2
            # print("text encoder outputs verified")

        return encoder_hidden_states1, encoder_hidden_states2, pool2

    def call_unet(self, args, accelerator, unet, noisy_latents, timesteps, text_conds, batch, weight_dtype):
        noisy_latents = noisy_latents.to(weight_dtype)  # TODO check why noisy_latents is not weight_dtype

        # get size embeddings
        orig_size = batch["original_sizes_hw"]
        crop_size = batch["crop_top_lefts"]
        target_size = batch["target_sizes_hw"]
        embs = sdxl_train_util.get_size_embeddings(orig_size, crop_size, target_size, accelerator.device).to(weight_dtype)

        # concat embeddings
        encoder_hidden_states1, encoder_hidden_states2, pool2 = text_conds
        vector_embedding = torch.cat([pool2, embs], dim=1).to(weight_dtype)
        text_embedding = torch.cat([encoder_hidden_states1, encoder_hidden_states2], dim=2).to(weight_dtype)

        noise_pred = unet(noisy_latents, timesteps, text_embedding, vector_embedding)
        return noise_pred

    def sample_images(self, accelerator, args, epoch, global_step, device, vae, tokenizer, text_encoder, unet):
        sdxl_train_util.sample_images(accelerator, args, epoch, global_step, device, vae, tokenizer, text_encoder, unet)


def setup_parser() -> argparse.ArgumentParser:
    parser = train_network.setup_parser()
    sdxl_train_util.add_sdxl_training_arguments(parser)
    return parser


if __name__ == "__main__":
    parser = setup_parser()

    args = parser.parse_args()
    args = train_util.read_config_from_file(args, parser)

    trainer = SdxlNetworkTrainer()
    trainer.train(args)

ptrblck · January 13, 2024, 3:51am

OK, this is still a bit weird, since both envs show your PyTorch installation should support CUDA.
Could you run another small test creating a random tensor in these envs just to make sure the GPU is also usable?

python -c "import torch; print(torch.__version__, torch.version.cuda, torch.cuda.is_available()); print(torch.randn(1).cuda())"