Welcome to the 1st issue of Pytorch Weekly, a weekly newsletter covering development in Pytorch AI development platform. You can subscribe the newsletter with firstname.lastname@example.org or Lingcc/pytorchweekly (github.com) .
Hidet is introduced on
PyTorchblog as a deep learning compiler for Efficient Model serving.
Hidet Scriptboth allow tensor program developers can easily handle the tile-based programming model.While, compared to
Hidet Scriptsimplifies tensor programming by handling the fine-grained computation and memory resources (e.g., warps, shared memory) manipulation.
TorchBench is introduced by
Yueming Haoand other guys from
Meta Platforms, Inc.
TorchBenchis a novel benchmark suite to study the performance of
PyTorchsoftware stack and has been used to identify the GPU performance inefficiencies in PyTorch, and it has also been integrated into the PyTorch continuous integration system.
- Towards Data Science published an amazing article: Build your own Transformer from scratch using Pytorch writen by Arjun Sarkar. It teaches the reader to build a transformer model step by step in PyTorch.
- The latest
PyTorch 2.0 Ask the Engineers Q&A Seriesbrought
TorchRLby Vincent Moens and Shashank Prasanna from
- Zachary DeVito contribute to the Pytorch Forum about Fast combined C++/Python/TorchScript/Inductor tracebacks
- David Stutz proposed a way for Loading and Saving PyTorch Models Without Knowing the Architecture in Advance
- Want to check the differences between
Jax? check JAX vs. PyTorch: Differences and Similarities 
Run PyTorch on Multiple GPUs thread was actived again since
SM2023tried to fine tune the GPT-2 model on multiple GPUs. Run model on multiple GPUs are not easy to handle, especially for load balance and parallel optimizations. Fresh guys are always recommanded to go through the Multi-GPU examples tutorials. Thanks to
- According to Would pytorch for cuda 11.6 work when cuda is actually 12.0, PyTorch binary currently shipped directly with
cuBLAS, etc, it uses
11.8by default. And only when build PyTorch from source, will it use the loca installed CUDA toolkit. You are recommanded to use the install method
- Result reproducibility is always a headache for ML training. The thread Different training results on different machines has lasted for more than 2 years disscussing about this. PyTorch doc Reproducibility has also mentioned that Pytorch does not guarante completely reproducible results. The thread added a new difference between Windows and Linux which might cause unproducable result since
glob.globon Windows produce an ordered list by default, however Linux output the random file list. it lead to different result.
JOROZCO proposed a way to convert PyTorch model to
How to fix “CUDA error: device-side assert triggered” error? introduced
CUDA_LAUNCH_BLOCKING=1to disable asynchronous kernel launhches.
- PyTorch main develop branch changed from
- CUDA 12.1 build is enabled again on windows
- Plenty of
Tritonbug fixes and improvement, such as add support for serializing real tensor data in after AOT minifier, Basic dynamo support for traceable collectives, Introduce FXGraphExtractor into torch.onnx.dynamo_export
Dan Dale fix a CPU offload performance issue for
ShardedGradScaler. The performance analyze of the work is amazing.
- Related changes to remove CUDA 11.6 support
- Improve the debug method for after AOT accuracy debugging
- Improve New Architecture support: Making FSDP device-agnositc for custom-backend which implement cuda-semantics, New hook for MTIA architecture
- Optimized EMA implementation
- Update Cutlass to v3.1
- Modular AI annouced its two init products. The first one is the fastest unified AI inference engine in the world. The second one is a new programming language for all AI developers