CIS 601: Special Topics in Computer Architecture: GPGPU Programming Spring 2016

Instructor

Office Hours: by appointment in Levine 572

Class discussion and announcements are via Piazza

When & Where

Tuesday/Thursday 10:30-12:00noon, Towne 321

Course Description

Graphics Processing Units (GPUs) have become extremely popular and are used to accelerate an increasingly diverse set of non-graphics workloads. This seminar will examine modern GPU architectures, the programming models used to write general-purpose code for GPUs, and the complexities of programming such highly parallel architectures. There will be a special emphasis on concurrency correctness issues as they relate to GPUs, including GPU memory consistency models and GPU concurrency bugs. Graduate-level coursework in computer architecture (e.g., CIS 501) will be very helpful.

Course Materials

No textbooks are required; links to all readings will be provided at this website.

Grading

Project: 40%
Participation: 20%
Assignments: 15%
Future work write-ups: 15%
Reading quizzes: 10%

There will be no exams.

Submit homework, reading quizzes and future-work write-ups via Canvas

The class project can be done in groups of up to 2. The project is open-ended: it should be something related to GPUs but the specifics are up to you. Choosing a project that incorporates your interests (research or otherwise) is a great idea!

Schedule

This schedule is subject to change

Many of the paper links below are to publisher sites (like the ACM Digital Library). You’ll need to download the papers from an on-campus computer or via the UPenn Library proxy

Date	Topic + Reading	Presenter
Thu 14 Jan	Intro	Joe
Tue 19 Jan	GPU Architecture Overview	Joe
Thu 21 Jan	”	Joe
Tue 26 Jan	”	Joe
Thu 28 Jan	Performance Analysis and Tuning for GPGPUs (Sections 1.4-1.5)	Joe
Tue 2 Feb	CUDA Basics	Joe
Thu 4 Feb	Cache-Conscious Wavefront Scheduling	Joe
Tue 9 Feb	Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow	Tim, David K.
Thu 11 Feb	CUDA Profiling	Joe
Tue 16 Feb	A Primer on Memory Consistency and Cache Coherence Chapters 3-4 (SC & TSO) [ slides ]	Ariel, David K.
Thu 18 Feb	A Primer on Memory Consistency and Cache Coherence Chapter 5 (RC) [ slides ]	Joe
Tue 23 Feb	Mathematizing C++ Concurrency	Joe
Thu 25 Feb	Heterogeneous-Race-Free Memory Models	Sarvesh, Ariel
Tue 1 Mar	GPU concurrency: Weak Behaviours and Programming Assumptions	David G., Karthik
Thu 3 Mar	CUDA Synchronization ]	Joe
Tue 8 Mar	No class: Spring Break
Thu 10 Mar	No class: Spring Break
Tue 15 Mar	No class: Joe at HPCA
Thu 17 Mar	No class: Joe at HPCA
Tue 22 Mar	Rhythm: Harnessing Data Parallel Hardware for Server Workloads	Sana, Toma
Thu 24 Mar	Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU	Sana, Karthik
Tue 29 Mar	GPUfs: integrating a file system with GPUs	Christian D., Tim
Thu 31 Mar	GRace: a low-overhead mechanism for detecting data races in GPU programs [ HW2 slides ]	Joe
Tue 5 Apr	LDetector: A Low Overhead Race Detector For GPU Programs	Toma, Sarvesh
Thu 7 Apr	GMrace: Detecting Data Races in GPU Programs via a Low-Overhead Scheme [ HW2 v2 slides ]	Joe
Tue 12 Apr	HAccRG: Hardware-Accelerated Data Race Detection in GPUs	Christian D.
Thu 14 Apr	Hardware Transactional Memory for GPU Architectures	Christian B.
Tue 19 Apr	Verifying GPU Kernels by Test Amplification	David G., Toma
Thu 21 Apr	Project Presentations
Tue 26 Apr	”

Project Ideas

Investigate time savings from approximate GPU computing. Consider replacing data types with narrower-width versions, e.g., converting 64-bit doubles to 32-bit floats, or 16-bit integers to 8-bit integers. How does this affect running time and accuracy of the computation?
Investigating CUDA Memcheck, a tool for detecting errors in CUDA programs. What is its performance overhead like? What kinds of bugs does it catch, and what kinds does it miss?
Investigating scalable locking in CUDA, from simple spin-locks to something like MCS locks. The lack of coherence on GPUs should add an interesting wrinkle. Useful resources are Michael Scott’s webpage and the SSync library from EPFL.
Port an application of interest to you to CUDA.
your idea here!