Lightnews — Scholar-powered news

HGPU group

@hgpu.bsky.social

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

#Triton #BLAS #GEMM #AMD #ROCm #HPC #Performance #Package

hgpu.org?p=30441

tritonBLAS: Triton-based Analytical Approach for GEMM Kernel Parameter Selection

We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM ke…

hgpu.org

December 7, 2025 at 9:32 PM

HGPU group

@hgpu.bsky.social

Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

#LLVM #Compilers

hgpu.org?p=30440

Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

Ensuring the correctness of compiler optimizations is critical, but existing fuzzers struggle to test optimizations effectively. First, most fuzzers use optimization pipelines (heuristics-based, fi…

hgpu.org

December 7, 2025 at 9:32 PM

HGPU group

@hgpu.bsky.social

Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

#Triton #Compilers #MachineLearning #ML #Thesis

hgpu.org?p=30439

Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: t…

hgpu.org

December 7, 2025 at 9:31 PM

HGPU group

@hgpu.bsky.social

hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware

#FPGA #HLS #MachineLearning #ML #DeepLearning #DL #Package

hgpu.org?p=30438

hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware

We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into…

hgpu.org

December 7, 2025 at 9:30 PM

HGPU group

@hgpu.bsky.social

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

#PTX #CUDA #Benchmarking #Blackwell #HPC

hgpu.org?p=30437

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood acr…

hgpu.org

December 7, 2025 at 9:30 PM

HGPU group

@hgpu.bsky.social

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

#Triton #CUDA #AI #CodeGeneration #LLM

hgpu.org?p=30413

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise fo…

hgpu.org

November 30, 2025 at 7:12 PM

HGPU group

@hgpu.bsky.social

KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

#Triton #CUDA #LLM #CodeGeneration

hgpu.org?p=30412

KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and softwa…

hgpu.org

November 30, 2025 at 7:11 PM

HGPU group

@hgpu.bsky.social

GPU-Initiated Networking for NCCL

#CUDA #NCCL

hgpu.org?p=30411

GPU-Initiated Networking for NCCL

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communicatio…

hgpu.org

November 30, 2025 at 7:11 PM

HGPU group

@hgpu.bsky.social

NVIDIA Nemotron Parse 1.1

#Nvidia #OCR #Package

hgpu.org?p=30410

NVIDIA Nemotron Parse 1.1

We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabi…

hgpu.org

November 30, 2025 at 7:09 PM

HGPU group

@hgpu.bsky.social

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

#CUDA #AI #Package

hgpu.org?p=30409

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existin…

hgpu.org

November 30, 2025 at 7:07 PM

HGPU group

@hgpu.bsky.social

Iris: First-Class Multi-GPU Programming Experience in Triton

#Triton #HIP #CUDA #Package

hgpu.org?p=30375

Iris: First-Class Multi-GPU Programming Experience in Triton

Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA…

hgpu.org

November 23, 2025 at 5:54 PM

HGPU group

@hgpu.bsky.social

ProofWright: Towards Agentic Formal Verification of CUDA

#CUDA #LLM #CodeGeneration

hgpu.org?p=30374

ProofWright: Towards Agentic Formal Verification of CUDA

Large Language Models (LLMs) are increasingly used to automatically generate optimized CUDA kernels, substantially improving developer productivity. However, despite rapid generation, these kernels…

hgpu.org

November 23, 2025 at 5:54 PM

HGPU group

@hgpu.bsky.social

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

#ROCm #CUDA #LLM

hgpu.org?p=30373

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrea…

hgpu.org

November 23, 2025 at 5:53 PM

HGPU group

@hgpu.bsky.social

Inside VOLT: Designing an Open-Source GPU Compiler

#OpenCL #CUDA #Compilers

hgpu.org?p=30372

Inside VOLT: Designing an Open-Source GPU Compiler

Recent efforts in open-source GPU research are opening new avenues in a domain that has long been tightly coupled with a few commercial vendors. Emerging open GPU architectures define SIMT function…

hgpu.org

November 23, 2025 at 5:51 PM

HGPU group

@hgpu.bsky.social

The Anatomy of a Triton Attention Kernel

#Triton #HIP #CUDA #LLM #Performance

hgpu.org?p=30371

The Anatomy of a Triton Attention Kernel

A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still d…

hgpu.org

November 23, 2025 at 5:51 PM

HGPU group

@hgpu.bsky.social

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

#OpenMP #FPGA #Fortran #Package

hgpu.org?p=30356

An MLIR pipeline for offloading Fortran to FPGAs via OpenMP

With the slowing of Moore’s Law, heterogeneous computing platforms such as Field Programmable Gate Arrays (FPGAs) have gained increasing interest for accelerating HPC workloads. In this work …

hgpu.org

November 16, 2025 at 3:02 PM

HGPU group

@hgpu.bsky.social

HipKittens: Fast and Furious AMD Kernels

#AMD #Performance #Package

hgpu.org?p=30355

HipKittens: Fast and Furious AMD Kernels

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, rece…

hgpu.org

November 16, 2025 at 3:01 PM

HGPU group

@hgpu.bsky.social

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

#CUDA #LLM #CodeGeneration

hgpu.org?p=30354

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel g…

hgpu.org

November 16, 2025 at 3:00 PM

HGPU group

@hgpu.bsky.social

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

#CUDA #Compression #Package

hgpu.org?p=30353

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leve…

hgpu.org

November 16, 2025 at 2:59 PM

HGPU group

@hgpu.bsky.social

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …

hgpu.org

November 16, 2025 at 2:58 PM