Note: This is not an ad, but a suggestion to add to the Torch library.
What is universal-cuda-tools?
universal-cuda-tools is a lightweight Python utility package designed to simplify and harden GPU/CPU execution workflows in PyTorch (and optionally TensorFlow).
It provides:
- A
DeviceContext
context manager for clean device, AMP, and cache handling - Simple and advanced decorators (
@cuda
,@cuda_advanced
) to make any function run safely on GPU or CPU - Optional automatic tensorization (e.g.
int
,list
,np.ndarray
→torch.Tensor
) - Built-in support for:
- Memory profiling
- Timeout handling
- Retry on failure
- Fallback to CPU when GPU out-of-memory (OOM)
- Mixed precision (AMP)
- Multi-GPU dispatch
- Dry-run and telemetry logging
- Utilities for converting and moving arbitrary Python, NumPy, or even TensorFlow inputs to the correct device
It’s built for real-world training pipelines, especially on constrained hardware, and aims to reduce the boilerplate and fragility in device management, AMP setup, and error handling.
Why does this matter?
Working with GPU-accelerated code in PyTorch often requires repetitive setup:
- Explicit device selection
- Manual
.to(device)
calls for all inputs - Handling CUDA memory errors (OOM)
- Managing AMP (
autocast
) scopes - Clearing cache and profiling memory
- Converting raw Python or NumPy types into tensors
universal-cuda-tools abstracts all of that into a minimal, composable, and Pythonic interface.
Installation
pip install universal-cuda-tools
Or install from source:
git clone https://github.com/Tunahanyrd/universal-cuda-tools.git
cd universal-cuda-tools
python -m build
pip install dist/universal_cuda_tools-*.whl
API Overview
@cuda
Decorator
A simple device-aware wrapper for lightweight functions. Supports retry, cache clearing, auto tensorization, and CPU fallback.
@cuda_advanced
Decorator
All of the above, plus timeout, AMP, multi-GPU dispatching, live dashboard, and dry-run support.
DeviceContext
A scoped context manager that handles device selection, AMP enablement, and auto tensorization for all code within the block.
Utilities
tensorize_for_universal(obj, device)
move_to_torch(device, obj)
patch_numpy_with_cupy()
Example
from cuda_tools import cuda
@cuda(device="cuda", auto_tensorize=True, to_list=True)
def add(a, b):
return a + b
print(add([1, 2], [3, 4])) # → [4, 6]
from cuda_tools import cuda_advanced
@cuda_advanced(timeout=0.5, use_amp=True, telemetry=True)
def train_step(model, x, y):
pred = model(x)
loss = (pred - y).square().mean()
loss.backward()
return loss.item()
Documentation & Demo
- Full Documentation
- PyTorch RFC Issue #152679
- Pypi +1.4 downloads
Project Structure
cuda_tools/
├── __init__.py # Exports decorators and context
├── decorators.py # @cuda and @cuda_advanced
├── context.py # DeviceContext class
├── utils.py # tensorize, device tools, CuPy support
License
MIT License © 2025 – Tunahan Yardımcı