Cloud service providers heavily colocate high-priority, latency-sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn’t expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models.

To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47x and BE job throughput by up to 2.36x.

Tue 4 Mar

Displayed time zone: Pacific Time (US & Canada) change

11:20 - 12:20
Session 7: Scheduling and Resource Management (Session Chair: Jie Ren)Main Conference at Acacia D
11:20
20m
Talk
SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs
Main Conference
Yongkang Zhang HKUST, Haoxuan Yu HKUST, Chenxia Han CUHK, Cheng Wang Alibaba Group, Baotong Lu Microsoft Research, Yunzhe Li Shanghai Jiao Tong University, Zhifeng Jiang HKUST, Yang Li China University of Geosciences, Xiaowen Chu Data Science and Analytics Thrust, HKUST(GZ), Huaicheng Li Virginia Tech
11:40
20m
Talk
DORADD: Deterministic Parallel Execution in the Era of Microsecond-Scale Computing
Main Conference
Scofield Liu Imperial College London, Musa Unal EPFL, Matthew J. Parkinson Microsoft Azure Research, Marios Kogias Imperial College London; Microsoft Research
12:00
20m
Talk
WaterWise: Co-optimizing Carbon- and Water-Footprint Toward Environmentally Sustainable Cloud Computing
Main Conference
Yankai Jiang Northeastern University, Rohan Basu Roy Northeastern University, Raghavendra Kanakagiri Indian Institute of Technology Tirupati, Devesh Tiwari Northeastern University