Universal CUDA Tools: GPU-Safe Execution Made Simple for PyTorch

Tunahanyrd · May 3, 2025, 9:31am

Note: This is not an ad, but a suggestion to add to the Torch library.

What is universal-cuda-tools?

universal-cuda-tools is a lightweight Python utility package designed to simplify and harden GPU/CPU execution workflows in PyTorch (and optionally TensorFlow).

It provides:

A DeviceContext context manager for clean device, AMP, and cache handling
Simple and advanced decorators (@cuda, @cuda_advanced) to make any function run safely on GPU or CPU
Optional automatic tensorization (e.g. int, list, np.ndarray → torch.Tensor)
Built-in support for:
- Memory profiling
- Timeout handling
- Retry on failure
- Fallback to CPU when GPU out-of-memory (OOM)
- Mixed precision (AMP)
- Multi-GPU dispatch
- Dry-run and telemetry logging
Utilities for converting and moving arbitrary Python, NumPy, or even TensorFlow inputs to the correct device

It’s built for real-world training pipelines, especially on constrained hardware, and aims to reduce the boilerplate and fragility in device management, AMP setup, and error handling.

Why does this matter?

Working with GPU-accelerated code in PyTorch often requires repetitive setup:

Explicit device selection
Manual .to(device) calls for all inputs
Handling CUDA memory errors (OOM)
Managing AMP (autocast) scopes
Clearing cache and profiling memory
Converting raw Python or NumPy types into tensors

universal-cuda-tools abstracts all of that into a minimal, composable, and Pythonic interface.

Installation

pip install universal-cuda-tools

Or install from source:

git clone https://github.com/Tunahanyrd/universal-cuda-tools.git
cd universal-cuda-tools
python -m build
pip install dist/universal_cuda_tools-*.whl

API Overview

`@cuda` Decorator

A simple device-aware wrapper for lightweight functions. Supports retry, cache clearing, auto tensorization, and CPU fallback.

`@cuda_advanced` Decorator

All of the above, plus timeout, AMP, multi-GPU dispatching, live dashboard, and dry-run support.

`DeviceContext`

A scoped context manager that handles device selection, AMP enablement, and auto tensorization for all code within the block.

Utilities

tensorize_for_universal(obj, device)
move_to_torch(device, obj)
patch_numpy_with_cupy()

Example

from cuda_tools import cuda

@cuda(device="cuda", auto_tensorize=True, to_list=True)
def add(a, b):
    return a + b

print(add([1, 2], [3, 4]))  # → [4, 6]

from cuda_tools import cuda_advanced

@cuda_advanced(timeout=0.5, use_amp=True, telemetry=True)
def train_step(model, x, y):
    pred = model(x)
    loss = (pred - y).square().mean()
    loss.backward()
    return loss.item()

Documentation & Demo

Project Structure

cuda_tools/
├── __init__.py        # Exports decorators and context
├── decorators.py      # @cuda and @cuda_advanced
├── context.py         # DeviceContext class
├── utils.py           # tensorize, device tools, CuPy support

License

ptrblck · May 3, 2025, 12:11pm

Thanks for sharing! While some utilities sound interesting for PyTorch Core the majority of the proposal sounds like a feature for higher-level APIs. Did you already talk to these teams, e.g. pytorch-lightning?

Tunahanyrd · May 3, 2025, 4:32pm

Thank you so much for taking the time to respond!

You’re absolutely right — most of the logic can indeed live in higher-level libraries like PyTorch Lightning or Accelerate. However, the internal structure of this tool is highly modular and granular.

For example, even without using the decorators, core functions such as safe_to_device(), try_batch_size(), or run_with_amp() can be used individually — making them potential candidates for inclusion in torch.utils or torch.cuda.

To be clear: this is not a proposal to merge the whole package as-is. Rather, I wanted to offer a small set of reusable utilities that address common patterns like:

clean device transfers
OOM-safe function calls
automatic AMP context
fallback handling
safe tensorization of inputs

These tools came out of repeated pain points in long training runs on low-memory devices, and I believe parts of them could help simplify day-to-day PyTorch code, even in core-level examples or docs.

If there’s any interest from the core team, I’d be happy to isolate and PR one or two focused components (e.g., safe_to_device() or a simple cuda_guard() context manager).

Thanks again!