Nvidia Maxwell 2 die layout, c/o Anandtech

CIS 601: Special Topics in Computer Architecture: GPGPU Programming Spring 2016


Joe Devietti

Office Hours: by appointment in Levine 572

Class discussion and announcements are via Piazza

When & Where

Tuesday/Thursday 10:30-12:00noon, Towne 321

Course Description

Graphics Processing Units (GPUs) have become extremely popular and are used to accelerate an increasingly diverse set of non-graphics workloads. This seminar will examine modern GPU architectures, the programming models used to write general-purpose code for GPUs, and the complexities of programming such highly parallel architectures. There will be a special emphasis on concurrency correctness issues as they relate to GPUs, including GPU memory consistency models and GPU concurrency bugs. Graduate-level coursework in computer architecture (e.g., CIS 501) will be very helpful.

Course Materials

No textbooks are required; links to all readings will be provided at this website.


  • Project: 40%
  • Participation: 20%
  • Assignments: 15%
  • Future work write-ups: 15%
  • Reading quizzes: 10%

There will be no exams.

Submit homework, reading quizzes and future-work write-ups via Canvas

The class project can be done in groups of up to 2. The project is open-ended: it should be something related to GPUs but the specifics are up to you. Choosing a project that incorporates your interests (research or otherwise) is a great idea!


This schedule is subject to change

Many of the paper links below are to publisher sites (like the ACM Digital Library). You’ll need to download the papers from an on-campus computer or via the UPenn Library proxy

Date Topic + Reading Presenter Assignment
Thu 14 Jan Intro Joe
Tue 19 Jan GPU Architecture Overview Joe
Thu 21 Jan Joe
Tue 26 Jan Joe
Thu 28 Jan Performance Analysis and Tuning for GPGPUs (Sections 1.4-1.5) Joe
Tue 2 Feb CUDA Basics Joe
Thu 4 Feb Cache-Conscious Wavefront Scheduling Joe
Tue 9 Feb Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Tim, David K.
Thu 11 Feb CUDA Profiling Joe
Tue 16 Feb A Primer on Memory Consistency and Cache Coherence Chapters 3-4 (SC & TSO) [ slides ] Ariel, David K.
Thu 18 Feb A Primer on Memory Consistency and Cache Coherence Chapter 5 (RC) [ slides ] Joe
Tue 23 Feb Mathematizing C++ Concurrency Joe
Thu 25 Feb Heterogeneous-Race-Free Memory Models Sarvesh, Ariel
Tue 1 Mar GPU concurrency: Weak Behaviours and Programming Assumptions David G., Karthik
Thu 3 Mar CUDA Synchronization ] Joe
Tue 8 Mar No class: Spring Break
Thu 10 Mar No class: Spring Break
Tue 15 Mar No class: Joe at HPCA
Thu 17 Mar No class: Joe at HPCA
Tue 22 Mar Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sana, Toma
Thu 24 Mar Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU Sana, Karthik
Tue 29 Mar GPUfs: integrating a file system with GPUs Christian D., Tim
Thu 31 Mar GRace: a low-overhead mechanism for detecting data races in GPU programs [ HW2 slides ] Joe
Tue 5 Apr LDetector: A Low Overhead Race Detector For GPU Programs Toma, Sarvesh
Thu 7 Apr GMrace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme [ HW2 v2 slides ] Joe
Tue 12 Apr HAccRG: Hardware-Accelerated Data Race Detection in GPUs Christian D.
Thu 14 Apr Hardware Transactional Memory for GPU Architectures Christian B.
Tue 19 Apr Verifying GPU Kernels by Test Amplification David G., Toma
Thu 21 Apr Project Presentations
Tue 26 Apr

Project Ideas

  • Investigate time savings from approximate GPU computing. Consider replacing data types with narrower-width versions, e.g., converting 64-bit doubles to 32-bit floats, or 16-bit integers to 8-bit integers. How does this affect running time and accuracy of the computation?
  • Investigating CUDA Memcheck, a tool for detecting errors in CUDA programs. What is its performance overhead like? What kinds of bugs does it catch, and what kinds does it miss?
  • Investigating scalable locking in CUDA, from simple spin-locks to something like MCS locks. The lack of coherence on GPUs should add an interesting wrinkle. Useful resources are Michael Scott’s webpage and the SSync library from EPFL.
  • Port an application of interest to you to CUDA.
  • your idea here!