HGPU group
banner
hgpu.bsky.social
HGPU group
@hgpu.bsky.social
High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC
Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles

#LLVM #Compilers

hgpu.org?p=30440
Targeted Testing of Compiler Optimizations via Grammar-Level Composition Styles
Ensuring the correctness of compiler optimizations is critical, but existing fuzzers struggle to test optimizations effectively. First, most fuzzers use optimization pipelines (heuristics-based, fi…
hgpu.org
December 7, 2025 at 9:32 PM
Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels

#Triton #Compilers #MachineLearning #ML #Thesis

hgpu.org?p=30439
Decoupled Triton: A Block-Level Decoupled Language for Writing and Exploring Efficient Machine-Learning Kernels
Machine-learning (ML) applications frequently utilize high-performance ML kernels to execute tensor operations like matrix product and softmax. An ML kernel can be decomposed into two components: t…
hgpu.org
December 7, 2025 at 9:31 PM
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

#Triton #CUDA #AI #CodeGeneration #LLM

hgpu.org?p=30413
QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise fo…
hgpu.org
November 30, 2025 at 7:12 PM
KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

#Triton #CUDA #LLM #CodeGeneration

hgpu.org?p=30412
KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit
High quality kernels are critical for reducing training and inference costs of Large Language Models (LLMs), yet they traditionally require significant expertise in hardware architecture and softwa…
hgpu.org
November 30, 2025 at 7:11 PM
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

#CUDA #AI #Package

hgpu.org?p=30409
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existin…
hgpu.org
November 30, 2025 at 7:07 PM
AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

#ROCm #CUDA #LLM

hgpu.org?p=30373
AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs
The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrea…
hgpu.org
November 23, 2025 at 5:53 PM
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

#CUDA #LLM #CodeGeneration

hgpu.org?p=30354
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel g…
hgpu.org
November 16, 2025 at 3:00 PM
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

#CUDA #Compression #Package

hgpu.org?p=30353
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data
The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leve…
hgpu.org
November 16, 2025 at 2:59 PM
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …
hgpu.org
November 16, 2025 at 2:58 PM
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

#CUDA #HIP #Compression #Package

hgpu.org?p=30342
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …
hgpu.org
November 9, 2025 at 4:28 PM
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

hgpu.org?p=30341
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…
hgpu.org
November 9, 2025 at 4:28 PM