Skip to main content

CS610: Programming For Performance

Course Description

To obtain good performance, one needs to write correct but scalable parallel programs using programming language abstractions like threads. In addition, the developer needs to be aware of and utilize many architecture-specific features like vectorization to extract the full performance potential. In this course, we will discuss programming language abstractions with architecture-aware development to learn to write scalable parallel programs.

This course will involve programming assignments to use the concepts learnt in class and appreciate the challenges in extracting performance.

Course Content

The​ course​ will​ ​primarily focus on the ​following topics.

  • Introduction: Challenges in parallel programming, correctness and performance errors, understanding performance, performance models
  • Exploit spatial and temporal locality with caches, analytical cache miss analysis
  • Shared-memory programming with Pthreads
  • Compiler transformations: Dependence analysis, Loop transformations
  • Compiler vectorization: vector ISA, auto-vectorizing compiler, vector intrinsics, assembly
  • OpenMP: Core and  Advanced OpenMP
  • Parallel Programming Models and Patterns: Intel Threading Building Blocks
  • GPU architecture and CUDA Programming
  • Performance bottleneck analysis: PAPI counters, performance analysis tools

We might add new, drop existing, or reorder topics depending on progress and class feedback.

The course may also involve reading and critiquing related research papers.