SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs (PPoPP 2025 - Main Conference)

Who

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yunzhe Li, Zhifeng Jiang, Yang Li, Xiaowen Chu, Huaicheng Li

Track

PPoPP 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 4 Mar 2025 11:20 - 11:40 at Acacia D - Session 7: Scheduling and Resource Management (Session Chair: Jie Ren)

Abstract

Cloud service providers heavily colocate high-priority, latency-sensitive (LS), and low-priority, best-effort (BE) DNN inference services on the same GPU to improve resource utilization in data centers. Among the critical shared GPU resources, there has been very limited analysis on the dynamic allocation of compute units and VRAM bandwidth, mainly for two reasons: (1) The native GPU resource management solutions are either hardware-specific, or unable to dynamically allocate resources to different tenants, or both; (2) NVIDIA doesn’t expose interfaces for VRAM bandwidth allocation, and the software stack and VRAM channel architectures are black-box, both of which limit the software-level resource management. These drive prior work to design either conservative sharing policies detrimental to throughput, or static resource partitioning only applicable to a few GPU models.

To bridge this gap, this paper proposes SGDRC, a fully software-defined dynamic VRAM bandwidth and compute unit management solution for concurrent DNN inference services. SGDRC aims at guaranteeing service quality, maximizing the overall throughput, and providing general applicability to NVIDIA GPUs. SGDRC first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs through comprehensive reverse engineering and eliminates VRAM channel conflicts using software-level cache coloring. SGDRC applies bimodal tensors and tidal SM masking to dynamically allocate VRAM bandwidth and compute units, and guides the allocation of resources based on offline profiling. We evaluate 11 mainstream DNNs with real-world workloads on two NVIDIA GPUs. The results show that compared with the state-of-the-art GPU sharing solutions, SGDRC achieves the highest SLO attainment rates (99.0% on average), and improves overall throughput by up to 1.47x and BE job throughput by up to 2.36x.

Yongkang Zhang

HKUST

Haoxuan Yu

HKUST

Chenxia Han

CUHK

Cheng Wang

Alibaba Group

Baotong Lu

Microsoft Research

Yunzhe Li

Shanghai Jiao Tong University

Zhifeng Jiang

HKUST

Yang Li

China University of Geosciences

Xiaowen Chu

Data Science and Analytics Thrust, HKUST(GZ)

Huaicheng Li

Virginia Tech

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 4 Mar
Displayed time zone: Pacific Time (US & Canada) change

11:20 - 12:20	Session 7: Scheduling and Resource Management (Session Chair: Jie Ren)Main Conference at Acacia D

11:20 20m Talk		SGDRC: Software-Defined Dynamic Resource Control for Concurrent DNN Inference on NVIDIA GPUs Main Conference Yongkang Zhang HKUST, Haoxuan Yu HKUST, Chenxia Han CUHK, Cheng Wang Alibaba Group, Baotong Lu Microsoft Research, Yunzhe Li Shanghai Jiao Tong University, Zhifeng Jiang HKUST, Yang Li China University of Geosciences, Xiaowen Chu Data Science and Analytics Thrust, HKUST(GZ), Huaicheng Li Virginia Tech
11:40 20m Talk		DORADD: Deterministic Parallel Execution in the Era of Microsecond-Scale Computing Main Conference Scofield Liu Imperial College London, Musa Unal EPFL, Matthew J. Parkinson Microsoft Azure Research, Marios Kogias Imperial College London; Microsoft Research
12:00 20m Talk		WaterWise: Co-optimizing Carbon- and Water-Footprint Toward Environmentally Sustainable Cloud Computing Main Conference Yankai Jiang Northeastern University, Rohan Basu Roy Northeastern University, Raghavendra Kanakagiri Indian Institute of Technology Tirupati, Devesh Tiwari Northeastern University