Need to export wheel with specific CUDA + pytorch version for H100 support

Hello.
Recently, I was lucky enought to have access to a H 100. I was trying to run some code of mine, to test the new hardware, but pytorch wasn’t working correctly. Wierd high memory allocation, inconsistent results and something that I can relate to buffer overflow. So I checked some forums and pr and find that H100 required cuda >=11.8.
Now my code run with

  • python 3.9
  • torch 1.13.1 + cu117

I would like to keep the same torch version because the next closest one with cuda > 11.8 is 2.0.0 and breaks the code.

I tried to build from source a wheel that I can install in different environment, but with no luck regarding all the library dependencies.

I check the builder and tried the conda/build_pytorch.sh, the manywheel/build.sh but something was always wrong or missing.

I tried also with a docker container in order to have the exact cuda and cudnn version that I was trying to build and manage to build a wheel but when I copied it in another environment, some of the lib where missing.

Dockerfile example

FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04

ENV PACKAGE_TYPE=conda
ENV DESIRED_CUDA=118
ENV DESIRED_PYTHON=3.9
ENV PYTORCH_BUILD_VERSION=1.13.1
ENV PYTORCH_BUILD_NUMBER=1
ENV TORCH_CONDA_BUILD_FOLDER=pytorch-nightly
WORKDIR /
RUN apt update && apt upgrade -y && apt install curl git nano
RUN git clone https://github.com/pytorch/builder.git
RUN git clone https://github.com/pytorch/pytorch.git

After installing anaconda I runned

./builder/conda/build.sh

And after the building the whole process crash because of missing library or wrong path ( like /usr/local/cuda/lib64/*.so not found because the lib is in /usr/lib/x86_64-linux-gnu )

I tried also tried this repo with this command

$ CUDA_TAG=11.8.0-cudnn8-devel-ubuntu22.04 COMMIT=49444c3e546bf240bed24a101e747422d1f8a0ee PYTHON_VERSION=3.9 USE_MPI=1 TORCH_CUDA_ARCH_LIST="9.0" bash build.sh

and I get the same result, a wheel torch-1.13.0a0+git49444c3-cp39-cp39-linux_x86_64.whl but with no .so with it.

Am I missing some steps ?

  • Is It possible to have this combination? python3.9, pytorch 1.13.1 and cuda 11.8
  • There are some FLAGS or ENV variables that I’m missing in order to have a complete portable .whl with all the dependencies ( cuda 11.8, cudnn8 )
  • I need to copy manually cudnn lib, mklib after the wheel is builded ?

There are some blogs / tutorials / topics / readme / crumb - tray that I can follow to have a pytorch wheel 1.13.1+cu118 with all the dependencies ?
Thanks in advance

# line 128 of manywheel/build_libtorch.sh
CMAKE_ARGS=${CMAKE_ARGS[@]} \
        EXTRA_CAFFE2_CMAKE_FLAGS="${EXTRA_CAFFE2_CMAKE_FLAGS[@]} $STATIC_CMAKE_FLAG" \
        CFLAGS='-Wno-deprecated-declarations' \
        BUILD_LIBTORCH_CPU_WITH_DEBUG=1 \
        python setup.py install# <--------- python setup.py bdist_wheel ???

Configuration

root@f7594da342f2:/Downloads/pytorch# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

root@f7594da342f2:/Downloads/pytorch# nvidia-smi
Thu Jan 11 17:54:03 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 PCIe    Off  | 00000000:02:00.0 Off |                    0 |
| N/A   52C    P0    92W / 350W |   1114MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Instead of trying to backport the CUDA support to an older PyTorch version, which might also need source code changes, it might be easier to update your code to be compatible with newer PyTorch releases. I don’t think the update from PyTorch 1.x to 2.x introduced any breaking changes.