Universal CUDA Tools: GPU-Safe Execution Made Simple for PyTorch

Note: This is not an ad, but a suggestion to add to the Torch library.

What is universal-cuda-tools?

universal-cuda-tools is a lightweight Python utility package designed to simplify and harden GPU/CPU execution workflows in PyTorch (and optionally TensorFlow).

It provides:

  • A DeviceContext context manager for clean device, AMP, and cache handling
  • Simple and advanced decorators (@cuda, @cuda_advanced) to make any function run safely on GPU or CPU
  • Optional automatic tensorization (e.g. int, list, np.ndarraytorch.Tensor)
  • Built-in support for:
    • Memory profiling
    • Timeout handling
    • Retry on failure
    • Fallback to CPU when GPU out-of-memory (OOM)
    • Mixed precision (AMP)
    • Multi-GPU dispatch
    • Dry-run and telemetry logging
  • Utilities for converting and moving arbitrary Python, NumPy, or even TensorFlow inputs to the correct device

It’s built for real-world training pipelines, especially on constrained hardware, and aims to reduce the boilerplate and fragility in device management, AMP setup, and error handling.


Why does this matter?

Working with GPU-accelerated code in PyTorch often requires repetitive setup:

  • Explicit device selection
  • Manual .to(device) calls for all inputs
  • Handling CUDA memory errors (OOM)
  • Managing AMP (autocast) scopes
  • Clearing cache and profiling memory
  • Converting raw Python or NumPy types into tensors

universal-cuda-tools abstracts all of that into a minimal, composable, and Pythonic interface.


Installation

pip install universal-cuda-tools

Or install from source:

git clone https://github.com/Tunahanyrd/universal-cuda-tools.git
cd universal-cuda-tools
python -m build
pip install dist/universal_cuda_tools-*.whl

API Overview

@cuda Decorator

A simple device-aware wrapper for lightweight functions. Supports retry, cache clearing, auto tensorization, and CPU fallback.

@cuda_advanced Decorator

All of the above, plus timeout, AMP, multi-GPU dispatching, live dashboard, and dry-run support.

DeviceContext

A scoped context manager that handles device selection, AMP enablement, and auto tensorization for all code within the block.

Utilities

  • tensorize_for_universal(obj, device)
  • move_to_torch(device, obj)
  • patch_numpy_with_cupy()

Example

from cuda_tools import cuda

@cuda(device="cuda", auto_tensorize=True, to_list=True)
def add(a, b):
    return a + b

print(add([1, 2], [3, 4]))  # → [4, 6]
from cuda_tools import cuda_advanced

@cuda_advanced(timeout=0.5, use_amp=True, telemetry=True)
def train_step(model, x, y):
    pred = model(x)
    loss = (pred - y).square().mean()
    loss.backward()
    return loss.item()

Documentation & Demo


Project Structure

cuda_tools/
├── __init__.py        # Exports decorators and context
├── decorators.py      # @cuda and @cuda_advanced
├── context.py         # DeviceContext class
├── utils.py           # tensorize, device tools, CuPy support

License

MIT License © 2025 – Tunahan Yardımcı

Thanks for sharing! While some utilities sound interesting for PyTorch Core the majority of the proposal sounds like a feature for higher-level APIs. Did you already talk to these teams, e.g. pytorch-lightning?

Thank you so much for taking the time to respond!

You’re absolutely right — most of the logic can indeed live in higher-level libraries like PyTorch Lightning or Accelerate. However, the internal structure of this tool is highly modular and granular.

For example, even without using the decorators, core functions such as safe_to_device(), try_batch_size(), or run_with_amp() can be used individually — making them potential candidates for inclusion in torch.utils or torch.cuda.

To be clear: this is not a proposal to merge the whole package as-is. Rather, I wanted to offer a small set of reusable utilities that address common patterns like:

  • clean device transfers
  • OOM-safe function calls
  • automatic AMP context
  • fallback handling
  • safe tensorization of inputs

These tools came out of repeated pain points in long training runs on low-memory devices, and I believe parts of them could help simplify day-to-day PyTorch code, even in core-level examples or docs.

If there’s any interest from the core team, I’d be happy to isolate and PR one or two focused components (e.g., safe_to_device() or a simple cuda_guard() context manager).

Thanks again!