### CIS 371 Computer Organization and Design

Unit 10: Shared Memory Multiprocessors

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Multiplying Performance

- A single processor can only be so fast
  - Limited clock frequency
  - Limited instruction-level parallelism
  - Limited cache hierarchy
- What if we need even more computing power?
  - Use multiple processors!
  - But how?

• High-end example: Sun Ultra Enterprise 25k

3

- 72 UltraSPARC IV+ processors, 1.5Ghz
- 1024 GBs of memory
- Niche: large database servers
- \$\$\$

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### This Unit: Shared Memory Multiprocessors



| Core 1<br>Core 2<br>1.5MB L2<br>L3 tags                                     | Multicore chips<br>IBM Power5<br>• Two 2+GHz PowerPC cores<br>• Shared 1.5 MB L2, L3 tags<br>AMD Quad Phenom<br>• Four 2.5-GHz cores<br>• Per-core 512KB L2 cache<br>• Shared 2MB L3 cache<br>Intel Core 2 Quad<br>• Four cores, shared 4 MB L2<br>• Two 4MB L2 caches |
|-----------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| •<br>Why multicore? What else would<br>you do with 500 million transistors? | Sun Niagara <ul> <li>8 cores, each 4-way threaded</li> <li>Shared 2MB L2, shared FP</li> </ul>                                                                                                                                                                         |

| Scientific computing/supercomputing                                                                                     |                 |
|-------------------------------------------------------------------------------------------------------------------------|-----------------|
| <ul> <li>Examples: weather simulation, aerodynamics, pro</li> <li>Large grids, integrating changes over time</li> </ul> | tein folding    |
| <ul> <li>Each processor computes for a part of the grid</li> </ul>                                                      |                 |
| Server workloads                                                                                                        |                 |
| Example: airline reservation database                                                                                   |                 |
| Many concurrent updates, searches, lookups, que                                                                         | ries            |
| Processors handle different requests                                                                                    |                 |
| Media workloads                                                                                                         |                 |
| • Processors compress/decompress different parts of                                                                     | of image/frames |
| Desktop workloads                                                                                                       |                 |
| Gaming workloads                                                                                                        |                 |
| But software must be written to expose                                                                                  | parallelism     |

### Multithreaded Programming Model

| Programmer explicitly creates multiple threads                                                                                                |
|-----------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>All loads &amp; stores to a single shared memory space</li> <li>Each thread has a private stack frame for local variables</li> </ul> |
| <ul> <li>A "thread switch" can occur at any time</li> <li>Pre-emptive multithreading by OS</li> </ul>                                         |
| Common uses:                                                                                                                                  |
| Handling user interaction (GUI programming)                                                                                                   |
| Handling I/O latency (send network message, wait for response)                                                                                |
| Expressing parallel work via Thread-Level Parallelism (TLP)                                                                                   |

### But First, Uniprocessor Concurrency

- Software "thread"
  - Independent flow of execution
  - Context state: PC, registers
  - Threads generally share the same memory space
  - "Process" like a thread, but different memory space
  - Java has thread support built in, C/C++ supports P-threads library
- Generally, system software (the O.S.) manages threads
  - "Thread scheduling", "context switching"
  - All threads share the one processor
    - Hardware timer interrupt occasionally triggers O.S.
    - Quickly swapping threads gives illusion of concurrent execution
  - Much more in CIS380

CIS 371 (Martin/Roth): Shared Memory Multiprocessors



### Simplest Multiprocessor



- Replicate entire processor pipeline!
  - Instead of replicating just register file & PC
  - Exception: share caches (we'll address this bottleneck later)
- Same "shared memory" or "multithreaded" model
  - Loads and stores from two processors are interleaved
- Advantages/disadvantages over hardware multithreading?

| CIS 371 (Martin/Roth): Shared Memory Multiprocessors |  |
|------------------------------------------------------|--|
|                                                      |  |

### Thread-Level Parallelism Example

| 0: addi r1,accts,r3 |
|---------------------|
| 1: ld 0(r3),r4      |
| 2: blt r4,r2,6      |
| 3: sub r4,r2,r4     |
| 4: st r4,0(r3)      |
| 5: call give_cash   |
|                     |

### • Thread-level parallelism (TLP)

- Collection of asynchronous tasks: not started and stopped together
- Data shared "loosely" (sometimes yes, mostly no), dynamically
- Example: database/web server (each query is a thread)
  - accts is shared, can't register allocate even if it were scalar
  - id and amt are private variables, register allocated to r1, r2

11

Running example

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Shared Memory Implementations

### • Multiplexed uniprocessor

- Runtime system and/or OS occasionally pre-empt & swap threads
- Interleaved, but no parallelism

### • Hardware multithreading

- Tolerate pipeline latencies, higher efficiency
- Same interleaved shared-memory model

### Multiprocessing

- Multiply execution resources, higher peak performance
- Same interleaved shared-memory model
- Foreshadowing: allow private caches, further disentangle cores

### All have same shared memory programming model

10

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### An Example Execution

| Thread 0                               | <u>Thread 1</u>                     | Mem                                     |
|----------------------------------------|-------------------------------------|-----------------------------------------|
| 0: addi r1,accts,r3                    |                                     | 500                                     |
| 1: ld 0(r3),r4 ******                  |                                     |                                         |
| 2: blt r4,r2,6                         |                                     |                                         |
| 3: sub r4,r2,r4                        |                                     |                                         |
| 4: st r4,0(r3)                         |                                     | 400                                     |
| 5: call give_cash                      | 0: addi r1,accts,r3                 |                                         |
|                                        | 1: ld 0(r3),r4                      | ••••••••••••••••••••••••••••••••••••••• |
|                                        | 2: blt r4,r2,6                      |                                         |
|                                        | 3: sub r4,r2,r4                     |                                         |
|                                        | 4: st r4,0(r3)                      | 300                                     |
|                                        | 5: call give_cash                   |                                         |
| • Two \$100 withdray                   | wals from account #241 at to        | wo ATMs                                 |
| <ul> <li>Each transaction m</li> </ul> | haps to thread on different process | sor                                     |
| <ul> <li>Track accts [241</li> </ul>   | ].bal (address is in r3)            |                                         |
| CIS 371 (Martin/Roth): Shared Mem      | on / Multiprocessors                | 12                                      |

### A **Problem** Execution

| Thread 0                              | Thread 1                     | Mem |
|---------------------------------------|------------------------------|-----|
| 0: addi r1,accts,r3                   |                              | 500 |
| 1: ld 0(r3),r4                        |                              |     |
| 2: blt r4,r2,6                        |                              |     |
| 3: sub r4,r2,r4                       |                              |     |
| <<< Interrupt >>>                     |                              |     |
|                                       | 0: addi r1,accts,r3          |     |
|                                       | 1: ld 0(r3),r4               | •   |
|                                       | 2: blt r4,r2,6               |     |
|                                       | 3: sub r4,r2,r4              | 400 |
|                                       | 4: st r4,0(r3)               | 400 |
|                                       | 5: call give_cash            |     |
| 4: st r4,0(r3)                        |                              | 400 |
| 5: call give_cash                     |                              |     |
| <ul> <li>Problem: wrong ac</li> </ul> | count balance! Why?          |     |
| Solution: synchroni                   | ze access to account balance |     |
| CIS 371 (Martin/Roth): Shared Memo    | . Multinue engenue           | 13  |

### A Synchronized Execution

| Thread 0                              | <u>Thread 1</u>          | Mem |
|---------------------------------------|--------------------------|-----|
| call acquire(lock)                    |                          | 500 |
| 0: addi r1,accts,r3                   |                          |     |
| l: ld 0(r3),r4 ******                 |                          |     |
| 2: blt r4,r2,6                        |                          |     |
| 3: sub r4,r2,r4                       |                          |     |
| <<< Interrupt >>>                     | call acquire(lock) Spins |     |
|                                       | <<< Interrupt >>>        |     |
| : st r4,0(r3)                         |                          | 400 |
| call release(lock)                    |                          |     |
| 5: call give_cash                     | (still in acquire)       |     |
|                                       | 0: addi r1,accts,r3      |     |
| <ul> <li>Fixed, but how do</li> </ul> | 1: ld 0(r3),r4           |     |
|                                       | 2: blt r4,r2,6           |     |
| we implement                          | 3: sub r4,r2,r4          | 300 |
| acquire & release?                    | 4: st r4,0(r3)           |     |
|                                       | 5: call give cash        |     |

### Synchronization

| • Synchronization: a key issue for shared memor                                                                        | γ          |
|------------------------------------------------------------------------------------------------------------------------|------------|
| Regulate access to shared data (mutual exclusion)                                                                      |            |
| Software constructs: semaphore, monitor, mutex                                                                         |            |
| Low-level primitive: lock                                                                                              |            |
| • Operations: acquire (lock) and release (lock)                                                                        |            |
| Region between acquire and release is a critic                                                                         | al section |
| Must interleave acquire and release                                                                                    |            |
| Interfering acquire will block                                                                                         |            |
| <pre>struct acct_t { int bal; };<br/>shared struct acct_t accts[MAX_ACCT];<br/>shared int lock;<br/>int id, amt;</pre> |            |
| acquire (lock);                                                                                                        |            |
| <pre>if (accts[id].bal &gt;= amt) { // critical section<br/>accts[id].bal -= amt;</pre>                                |            |
| <pre>give_cash(); }</pre>                                                                                              |            |
| release(lock);                                                                                                         |            |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                                                                   | 14         |

| Spin lock: | software lock impler        | mentation           |
|------------|-----------------------------|---------------------|
| • acquire  | (lock): while (loc          | (k != 0); lock = 1; |
| • "Spin" \ | while lock is 1, wait for i | t to turn 0         |
|            | ld 0(&lock),r6              |                     |
|            | bnez r6,A0                  |                     |
| A2 :       | addi r6,1,r6                |                     |
| A3:        | st r6,0(&lock)              |                     |
| • release  | (lock): lock = 0;           |                     |
| R0 :       | st r0,0(&lock)              | // r0 holds 0       |
|            |                             |                     |
|            |                             |                     |
|            |                             |                     |
|            |                             |                     |

### Strawman Lock (Incorrect)

| Thread 0                               | Thread 1                                | Mem                    |
|----------------------------------------|-----------------------------------------|------------------------|
| A0: ld 0(&lock),r6                     |                                         | ····· 0 <mark>6</mark> |
| A1: bnez r6,#A0                        | A0: ld r6,0(&lock) -                    |                        |
| A2: addi r6,1,r6                       | A1: bnez r6,#A0                         |                        |
| A3: st r6,0(&lock)                     | A2: addi r6,1,r6                        |                        |
| CRITICAL_SECTION                       | A3: st r6,0(&lock)                      |                        |
|                                        | CRITICAL_SECTION                        |                        |
|                                        |                                         | +                      |
| • Spin lock makes i                    | ntuitive sense, but doesn't ac          | tually work            |
| <ul> <li>Loads/stores of ty</li> </ul> | wo <b>acquire</b> sequences can be inte | rleaved                |
| <ul> <li>Lock acquire Set</li> </ul>   | quence also not atomic                  |                        |
| Same problem                           |                                         |                        |
|                                        |                                         |                        |

17

• Note, **release** is trivially atomic

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Better Spin Lock: Use Atomic Swap

| <ul> <li>ISA provides an atomic lock acquisition instruction</li> </ul>             |
|-------------------------------------------------------------------------------------|
| Example: atomic swap                                                                |
| swap $r1,0(\&lock)$ mov $r1 -> r2$                                                  |
| Atomically executes: 1d r1,0(&lock)                                                 |
| st r2,0(&lock)                                                                      |
| New acquire sequence                                                                |
| (value of r1 is 1)                                                                  |
| A0: swap r1,0(&lock)                                                                |
| A1: bnez r1,A0                                                                      |
| <ul> <li>If lock was initially busy (1), doesn't change it, keep looping</li> </ul> |
| • If lock was initially free (0), acquires it (sets it to 1), break loop            |
| Insures lock held by at most one thread                                             |
| <ul> <li>Other variants: exchange, compare-and-swap, test-and-set,</li> </ul>       |
| or fetch-and-add                                                                    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 19                             |
| or fetch-and-add                                                                    |

### A Correct Implementation: SYSCALL Lock

| 1: disable_interrupts                          | atomic                              |  |
|------------------------------------------------|-------------------------------------|--|
| 2: ld r6,0(&lock)                              |                                     |  |
| A3: bnez r6,#A0                                |                                     |  |
| 4: addi r6,1,r6                                |                                     |  |
| A5: st r6,0(&lock)                             |                                     |  |
| A6: enable_interrupts                          |                                     |  |
| 7: return                                      |                                     |  |
|                                                |                                     |  |
| Implement lock in a SVS                        | CALL                                |  |
| <ul> <li>Implement lock in a SYS</li> </ul>    | BCALL                               |  |
| <ul> <li>Only kernel can control in</li> </ul> | nterleaving by disabling interrupts |  |
| + Works                                        |                                     |  |
| <ul> <li>Large system call overhea</li> </ul>  | ad                                  |  |
| - But not in a hardware mu                     | Iltithreading or a multiprocessor   |  |
| Bat not in a hardware me                       | and including of a multiprocessor   |  |
|                                                |                                     |  |



### Lock Correctness

| Thread 0                           | Thread 1               |    |
|------------------------------------|------------------------|----|
| A0: swap r1,0(&loc                 | k)                     |    |
| A1: bnez r1,#A0                    | A0: swap r1,0(&lock)   |    |
| CRITICAL_SECTION                   | A1: bnez r1,#A0        |    |
|                                    | A0: swap r1,0(&lock)   |    |
|                                    | A1: bnez r1,#A0        |    |
| + Test-and-set loc                 | k actually works       |    |
| <ul> <li>Thread 1 keeps</li> </ul> | spinning               |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
|                                    |                        |    |
| CIS 371 (Martin/Roth): Shared 1    | Memory Multiprocessors | 21 |

### Coarse-Grain Locks: Correct but Slow

• **Coarse-grain locks**: e.g., one lock for entire database

- + Easy to make correct: no chance for unintended interference
- No P in TLP: no two critical sections can proceed in parallel

| <pre>struct acct_t { int bal; };<br/>shared struct acct_t accts[MAX_ACCT];<br/>int id,amt;<br/>shared int lock;</pre> |    |
|-----------------------------------------------------------------------------------------------------------------------|----|
| <pre>acquire(lock);<br/>if (accts[id].bal &gt;= amt) {<br/>accts[id].bal -= amt;<br/>give_cash(); }</pre>             |    |
| release(lock);                                                                                                        |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                                                                  | 23 |
|                                                                                                                       |    |

### Programming With Locks Is Difficult

- Multicore processors are the way of the foreseeable future
  - TLP anointed as parallelism model of choice
  - Just one problem...

• Writing lock-based multi-threaded programs is difficult!

### • More precisely:

- Writing programs that are correct is "easy" (not really)
- Writing programs that are highly parallel is "easy" (not really)

### - Writing programs that are both correct and parallel is difficult

22

- Very difficult (true)
- Unfortunate goal (but that's the whole point after all)
- Locking granularity issues
- CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Fine-Grain Locks: Parallel But Difficult

| <ul> <li>Fine-grain locks: e.g., multiple locks, one p</li> </ul>                             | er record        |
|-----------------------------------------------------------------------------------------------|------------------|
| + Fast: critical sections (to different records) can proc                                     | ceed in parallel |
| <ul> <li>Difficult to make correct: easy to make mistakes</li> </ul>                          |                  |
| This particular example is easy                                                               |                  |
| <ul> <li>Requires only one lock per critical section</li> </ul>                               |                  |
| Consider critical section that requires two locks.                                            |                  |
| <pre>struct acct_t { int bal,lock; }; shared struct acct_t accts[MAX_ACCT]; int id,amt;</pre> |                  |
| acquire (accts[id].lock);                                                                     |                  |
| <pre>if (accts[id].bal &gt;= amt) {     accts[id].bal -= amt;     give_cash(); }</pre>        |                  |
| release(accts[id].lock);                                                                      |                  |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                                          | 24               |

### **Multiple Locks**

- Multiple locks: e.g., acct-to-acct transfer
  - Must acquire both id\_from, id\_to locks
  - Running example with accts 241 and 37
  - Simultaneous transfers 241  $\rightarrow$  37 and 37  $\rightarrow$  241
  - Contrived... but even contrived examples must work correctly too

25

struct acct\_t { int bal,lock; };
shared struct acct\_t accts[MAX\_ACCT];
int id\_from,id\_to,amt;

acquire(accts[id\_from].lock); acquire(accts[id\_to].lock); if (accts[id\_from].bal >= amt) { accts[id\_from].bal -= amt; accts[id\_to].bal += amt; } release(accts[id\_to].lock); release(accts[id\_from].lock); CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Correct Multiple Lock Program

| <ul> <li>Always acquire multiple locks in same order</li> <li>Just another thing to keep in mind when programming</li> <li>Ho hum</li> </ul> |  |
|----------------------------------------------------------------------------------------------------------------------------------------------|--|
| <pre>struct acct t { int bal,lock; };</pre>                                                                                                  |  |
| shared struct acct t accts[MAX ACCT];                                                                                                        |  |
| int id from, id to, amt;                                                                                                                     |  |
| <pre>int id_first = min(id_from, id_to);</pre>                                                                                               |  |
| <pre>int id_second = max(id_from, id_to);</pre>                                                                                              |  |
| acquire (accts[id first].lock);                                                                                                              |  |
| acquire(accts[id_second].lock);                                                                                                              |  |
| if (accts[id_from].bal >= amt) {                                                                                                             |  |
| accts[id_from].bal -= amt;                                                                                                                   |  |
| <pre>accts[id_to].bal += amt; }</pre>                                                                                                        |  |
| release(accts[id_second].lock);                                                                                                              |  |
| release (accts [id_first].lock);<br>CIS 371 (Martin/Roth): Shared Memory Multiprocessors 27                                                  |  |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 27                                                                                      |  |
|                                                                                                                                              |  |

### Multiple Locks And Deadlock

| Thread 0                                       | Thread 1                               |
|------------------------------------------------|----------------------------------------|
| id_from = 241;                                 | id_from = 37;                          |
| id_to = 37;                                    | id_to = 241;                           |
| acquire(accts[241].lock);                      | <pre>acquire(accts[37].lock);</pre>    |
| <pre>// wait to acquire lock</pre>             | <pre>// wait to acquire lock 241</pre> |
| 37                                             | // waiting                             |
| // waiting                                     | //                                     |
| <pre>// still waiting</pre>                    |                                        |
| <ul> <li>Deadlock: circular wait</li> </ul>    | for shared resources                   |
| <ul> <li>Thread 0 has lock 241 was</li> </ul>  | aits for lock 37                       |
| Thread 1 has lock 37 wai                       | ts for lock 241                        |
| <ul> <li>Obviously this is a proble</li> </ul> | m                                      |
| The solution is                                |                                        |
| CIS 371 (Martin/Roth): Shared Memory Multip    | processors 26                          |

### Correct Multiple Lock Execution

| <u>Ihread 0</u>                                | <u>Thread 1</u>                     |
|------------------------------------------------|-------------------------------------|
| id_from = 241;                                 | id_from = 37;                       |
| id_to = 37;                                    | id_to = 241;                        |
| id_first = min(241,37)=37;                     | id_first = min(37,241)=37;          |
| id_second = max(37,241)=241;                   | id_second = max(37,241)=241;        |
| acquire(accts[37].lock);                       | // wait to acquire lock 37          |
| acquire(accts[241].lock);                      | // waiting                          |
| // do stuff                                    | 11                                  |
| release(accts[241].lock);                      | 11                                  |
| release(accts[37].lock);                       | //                                  |
|                                                | <pre>acquire(accts[37].lock);</pre> |
| • Great, are we done? No                       |                                     |
| CIS 371 (Martin/Roth): Shared Memory Multiproc | ressors 28                          |

### More Lock Madness

### What if...

- Some actions (e.g., deposits, transfers) require 1 or 2 locks...
- ....and others (e.g., prepare statements) require all of them?
- Can these proceed in parallel?
- What if...
  - There are locks for global variables (e.g., operation id counter)?
  - When should operations grab this lock?
- What if... what if... what if...

### • So lock-based programming is difficult...

• ...wait, it gets worse

Transactional Memory

| CIS 371 (Martin/Roth): Shared Memory Multipro | cessors |
|-----------------------------------------------|---------|
|                                               |         |

29

### Research: Transactional Memory (TM)

+ Programming simplicity of coarse-grain locks

|                                   | , emploie, el course gran locito                   |
|-----------------------------------|----------------------------------------------------|
| + Higher concu                    | rrency (parallelism) of fine-grain locks           |
| <ul> <li>Critical set</li> </ul>  | ections only serialized if data is actually shared |
| + No lock acqu                    | isition overhead                                   |
| <ul> <li>Hottest thing</li> </ul> | since sliced bread                                 |
| <ul> <li>No fewer that</li> </ul> | n 9 research projects: Brown, Stanford, MIT, Intel |
| <ul> <li>Penn too</li> </ul>      |                                                    |
|                                   |                                                    |
|                                   |                                                    |
|                                   |                                                    |
|                                   |                                                    |
|                                   |                                                    |
|                                   |                                                    |
|                                   |                                                    |
| IS 371 (Martin/Roth): Shar        | red Memory Multiprocessors 31                      |

### And To Make It Worse...

## Acquiring locks is expensive... By definition requires a slow atomic instructions Specifically, acquiring write permissions to the lock Ordering constraints (see soon) make it even slower ...and 99% of the time un-necessary Most concurrent actions don't actually share data You paying to acquire the lock(s) for no reason Fixing these problem is an area of active research One proposed solution "Transactional Memory"

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

30

### Transactional Memory: The Big Idea

| Big idea I: no locks, just shared data                                       |   |
|------------------------------------------------------------------------------|---|
| Look ma, no locks                                                            |   |
| Big idea II: optimistic (speculative) concurrency                            |   |
| Execute critical section speculatively, abort on conflicts                   |   |
| "Better to beg for forgiveness than to ask for permission"                   |   |
|                                                                              |   |
|                                                                              |   |
| <pre>struct acct_t { int bal; }; shared struct acct t accts[MAX ACCT];</pre> |   |
| int id_from, id_to, amt;                                                     |   |
| <pre>begin_transaction();</pre>                                              |   |
| if (accts[id_from].bal >= amt) {                                             |   |
| accts[id_from].bal -= amt;<br>accts[id_to].bal += amt; }                     |   |
| end_transaction();                                                           |   |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 3                       | 2 |

### Transactional Memory: Read/Write Sets

- **Read set**: set of shared addresses critical section reads
  - Example: accts[37].bal, accts[241].bal
- Write set: set of shared addresses critical section writes
  - Example: accts[37].bal, accts[241].bal

struct acct\_t { int bal; }; shared struct acct\_t accts[MAX\_ACCT]; int id from,id to,amt;

### begin\_transaction();

| <pre>if (accts[id_from].bal &gt;= amt) {     accts[id_from].bal -= amt;</pre> |    |
|-------------------------------------------------------------------------------|----|
|                                                                               |    |
| accts[id to].bal += amt; }                                                    |    |
| end_transaction();                                                            |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                          | 33 |
|                                                                               |    |

### Transactional Memory: End

### • end transaction

- Check read set: is all data you read still valid (i.e., no writes to any)
- Yes? Commit transactions: commit writes
- No? Abort transaction: restore checkpoint

struct acct\_t { int bal; }; shared struct acct\_t accts[MAX\_ACCT]; int id\_from,id\_to,amt;

### begin\_transaction();

| if (accts[id_from].bal >= amt) {                     |    |
|------------------------------------------------------|----|
| accts[id_from].bal -= amt;                           |    |
| accts[id to].bal += amt; }                           |    |
| <pre>end transaction();</pre>                        |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors | 35 |
|                                                      |    |

### Transactional Memory: Begin

| • begin_transaction                                                                   |
|---------------------------------------------------------------------------------------|
| Take a local register checkpoint                                                      |
| <ul> <li>Begin locally tracking read set (remember addresses you read)</li> </ul>     |
| See if anyone else is trying to write it                                              |
| <ul> <li>Locally buffer all of your writes (invisible to other processors)</li> </ul> |
| + Local actions only: no lock acquire                                                 |
|                                                                                       |
|                                                                                       |
| <pre>struct acct_t { int bal; }; shared struct acct t accts[MAX ACCT];</pre>          |
| int id from, id to, amt;                                                              |
|                                                                                       |
| <pre>begin_transaction();</pre>                                                       |
| if (accts[id_from].bal >= amt) {                                                      |
| accts[id_from].bal -= amt;                                                            |
| <pre>accts[id_to].bal += amt; }</pre>                                                 |
| end_transaction();                                                                    |

### CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Transactional Memory Implementation

| • | <ul><li>How are read-set/write-set implemented?</li><li>Track locations accessed using bits in the cache</li></ul>                                                                                                                                                    |
|---|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | <ul> <li>Read-set: additional "transactional read" bit per block</li> <li>Set on reads between begin_transaction and end_transaction</li> <li>Any other write to block with set bit → triggers abort</li> <li>Flash cleared on transaction abort or commit</li> </ul> |
| • | <ul> <li>Write-set: additional "transactional write" bit per block</li> <li>Set on writes between begin_transaction and end_transaction</li> <li>Flash cleared on transaction commit</li> <li>On transaction abort: blocks with set bit are invalidated</li> </ul>    |

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

34

### **Transactional Execution**

| Thread 0                                      | Thread 1                                  |
|-----------------------------------------------|-------------------------------------------|
| id_from = 241;                                | id_from = 37;                             |
| id_to = 37;                                   | id_to = 241;                              |
| <pre>begin_transaction();</pre>               | <pre>begin_transaction();</pre>           |
| if(accts[241].bal > 100) {                    | if(accts[37].bal > 100) {                 |
|                                               | <pre>accts[37].bal -= amt;</pre>          |
| <pre>// write accts[241].bal</pre>            | <pre>acts[241].bal += amt;</pre>          |
| // abort                                      | }                                         |
|                                               | <pre>end transaction();</pre>             |
|                                               | <pre>// no writes to accts[241].bal</pre> |
|                                               | <pre>// no writes to accts[37].bal</pre>  |
|                                               | // commit                                 |
|                                               |                                           |
|                                               |                                           |
|                                               |                                           |
| CIS 371 (Martin/Roth): Shared Memory Multipro | cessors 37                                |

### So, Let's Just Do Transactions?

| What if                                                                     |    |
|-----------------------------------------------------------------------------|----|
| Read-set or write-set bigger than cache?                                    |    |
| <ul> <li>Transaction gets swapped out in the middle?</li> </ul>             |    |
| <ul> <li>Transaction wants to do I/O or SYSCALL (not-abortable)?</li> </ul> |    |
| <ul> <li>How do we transactify existing lock based programs?</li> </ul>     |    |
| <ul> <li>Replace acquire with begin_trans does not always wor</li> </ul>    | k  |
| Several different kinds of transaction semantics                            |    |
| Which one do we want?                                                       |    |
| That's what these research groups are looking at                            |    |
| Industry adoption:                                                          |    |
| Sun's Rock processor has best-effort hardware TM                            |    |
| Speculative locking: Azul systems and Intel (rumor)                         |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                        | 39 |

### Transactional Execution II (More Likely)

| Thread 0                                         | Thread 1                                 |
|--------------------------------------------------|------------------------------------------|
| id_from = 241;                                   | id_from = 450;                           |
| id_to = 37;                                      | id_to = 118;                             |
| <pre>begin_transaction();</pre>                  | <pre>begin_transaction();</pre>          |
| if(accts[241].bal > 100) {                       | if(accts[450].bal > 100) {               |
| accts[241].bal -= amt;                           | accts[450].bal -= amt;                   |
| <pre>acts[37].bal += amt;</pre>                  | <pre>acts[118].bal += amt;</pre>         |
| }                                                | }                                        |
| end_transaction();                               | <pre>end_transaction();</pre>            |
| <pre>// no write to accts[240].bal</pre>         | <pre>// no write to accts[450].bal</pre> |
| <pre>// no write to accts[37].bal</pre>          | <pre>// no write to accts[118].bal</pre> |
| // commit                                        | // commit                                |
| • Critical sections execute in                   | parallel                                 |
|                                                  |                                          |
| CIS 371 (Martin/Roth): Shared Memory Multiproces | isors 38                                 |

### Roadmap Checkpoint

| System software | <ul> <li>Shared memory model</li> </ul>                                 |
|-----------------|-------------------------------------------------------------------------|
|                 | Multiplexed uniprocessor     Hardware multihreading     Multiprocessing |
|                 | Synchronization                                                         |
|                 | Lock implementation     Locking gotchas                                 |
|                 | Cache coherence                                                         |
|                 | Bus-based protocols                                                     |
|                 | Directory protocols                                                     |
|                 | Memory consistency models                                               |



### Shared-Memory Multiprocessors



# Shared-Memory Multiprocessors Conceptual model The shared-memory abstraction Familiar and feels natural to programmers The would be easy if systems actually looked like this... P<sub>0</sub> P<sub>1</sub> P<sub>2</sub> P<sub>3</sub> P<sub>1</sub> P<sub>2</sub> P<sub>3</sub> Memory

### Revisiting Our Motivating Example

| Processor 0                                                                             | Processor 1                                    | CPU0 CPU1 Mem                         |
|-----------------------------------------------------------------------------------------|------------------------------------------------|---------------------------------------|
| 0: addi \$r3,\$r1,&accts                                                                | 5                                              |                                       |
| 1: lw \$r4,0(\$r3)<br>2: blt \$r4,\$r2,6<br>3: sub \$r4,\$r4,\$r2<br>4: sw \$r4,0(\$r3) | critical section<br>(locks not shown)          |                                       |
| 5: jal dispense_cash                                                                    | 0: addi \$r3,\$r1,&accts<br>1: lw \$r4,0(\$r3) | )                                     |
|                                                                                         | 2: blt \$r4,\$r2,6<br>3: sub \$r4,\$r4,\$r2    | critical section<br>(locks not shown) |
|                                                                                         | 4: sw \$r4,0(\$r3)<br>5: jal dispense_cash     | J                                     |
| • Two \$100 withdraw                                                                    | als from account #241 a                        | t two ATMs                            |
| Each transaction ma                                                                     | aps to thread on different proc                | cessor                                |
| <ul> <li>Track accts [241]</li> </ul>                                                   | .bal (address is in \$r3)                      |                                       |
| CIS 371 (Martin/Roth): Shared Memor                                                     | ry Multiprocessors                             | 44                                    |

### No-Cache, No-Problem

| Processor 0                         | Processor 1              | CPU0 CPU1 Mem |
|-------------------------------------|--------------------------|---------------|
| 0: addi \$r3,\$r1,&accts            |                          | \$500         |
| 1: lw \$r4,0(\$r3) ******           |                          | \$500         |
| 2: blt \$r4,\$r2,6                  |                          |               |
| 3: sub \$r4,\$r4,\$r2               |                          |               |
| 4: sw \$r4,0(\$r3)                  |                          | \$400         |
| 5: jal dispense_cash                | 0: addi \$r3,\$r1,&accts |               |
|                                     | 1: lw \$r4,0(\$r3) ◄     | \$400         |
|                                     | 2: blt \$r4,\$r2,6       |               |
|                                     | 3: sub \$r4,\$r4,\$r2    |               |
|                                     | 4: sw \$r4,0(\$r3)       | \$300         |
|                                     | 5: jal dispense_cash     |               |
| Scenario I: processo                | ors have no caches       |               |
| No problem                          |                          |               |
|                                     |                          |               |
| CIS 371 (Martin/Roth): Shared Memor | y Multiprocessors        | 45            |

### Write-Through Doesn't Fix It

| Processor 0                                | Processor 1 CPU0 CPU1 Mem                               |
|--------------------------------------------|---------------------------------------------------------|
| 0: addi \$r3,\$r1,&accts                   | \$500                                                   |
| 1: lw \$r4,0(\$r3) ******                  | • \$500 \$500                                           |
| 2: blt \$r4,\$r2,6                         |                                                         |
| 3: sub \$r4,\$r4,\$r2                      |                                                         |
| 4: sw \$r4,0(\$r3)                         | <mark>▶\$400 </mark> \$400                              |
| 5: jal dispense_cash                       | 0: addi \$r3,\$r1,&accts                                |
|                                            | 1: lw \$r4,0(\$r3) ************************************ |
|                                            | 2: blt \$r4,\$r2,6                                      |
|                                            | 3: sub \$r4,\$r4,\$r2                                   |
|                                            | 4: sw \$r4,0(\$r3)                                      |
|                                            | 5: jal dispense_cash                                    |
| • Scenario II(b): proce                    | essors have write-through caches                        |
| <ul> <li>This time only 2 (diff</li> </ul> | ferent) copies of accts [241] .bal                      |
| <ul> <li>No problem? What if</li> </ul>    | another withdrawal happens on processor 0?              |

47

### Cache Incoherence

| Processor 0                                 | Processor 1                             | CPU1 Mem                        |
|---------------------------------------------|-----------------------------------------|---------------------------------|
| 0: addi \$r3,\$r1,&accts                    |                                         | \$500                           |
| 1: lw \$r4,0(\$r3)                          | ••••••••••••••••••••••••••••••••••••••• | \$500                           |
| 2: blt \$r4,\$r2,6                          |                                         |                                 |
| 3: sub \$r4,\$r4,\$r2                       |                                         |                                 |
| 4: sw \$r4,0(\$r3)                          | <mark>⊳\$400</mark>                     | \$500                           |
| 5: jal dispense_cash                        | 0: addi \$r3,\$r1,&accts                |                                 |
|                                             | 1: lw \$r4,0(\$r3) -\$400               | ·· <mark>· \$500</mark> • \$500 |
|                                             | 2: blt \$r4,\$r2,6                      |                                 |
|                                             | 3: sub \$r4,\$r4,\$r2                   |                                 |
|                                             | 4: sw \$r4,0(\$r3)                      | ********                        |
|                                             | 5: jal dispense_cash                    |                                 |
| • Scenario II(a): proce                     | essors have write-back caches           | ;                               |
| <ul> <li>Potentially 3 copies of</li> </ul> | f accts[241].bal: memory, p0\$          | , p1\$                          |
| Can get incoherent (                        |                                         |                                 |
| CIS 371 (Martin/Roth): Shared Memory        | Multiprocessors                         | 46                              |

### What To Do?

| No caches?                                                         |         |
|--------------------------------------------------------------------|---------|
| - Slow                                                             |         |
| Make shared data uncachable?                                       |         |
| <ul> <li>Faster, but still too slow</li> </ul>                     |         |
| <ul> <li>Entire accts database is technically "shared"</li> </ul>  |         |
| <ul> <li>Definition of "loosely shared"</li> </ul>                 |         |
| Data only really shared if two ATMs access same acct               | at once |
| • Flush all other caches on writes to shared data?                 |         |
| May as well not have caches                                        |         |
| Hardware cache coherence                                           |         |
| Rough goal: all caches have same data at all times                 |         |
| + Minimal flushing, maximum caching $\rightarrow$ best performance |         |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors               | 48      |

### **Bus-based Multiprocessor**

- Simple multiprocessors use a bus
  - All processors see all requests at the same time, same order



### Hardware Cache Coherence Coherence all copies have same data at all times Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you set

bus

- What to do with local copy when you see different things happening on bus
- Three processor-initiated events
- R: read W: write WB: write-back
- One response event: **SD:** send data
- Two remote-initiated events
  - BR: bus-read, read miss from *another* processor
  - BW: bus-write, write miss from another processor

| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 50 |   |                   |               |           |             |   |  |    |  |
|---------------------------------------------------------|---|-------------------|---------------|-----------|-------------|---|--|----|--|
|                                                         | C | CIS 371 (Martin/R | Roth): Shared | Memory Mu | tiprocessor | s |  | 50 |  |

### VI (MI) Coherence Protocol BR/BW • VI (valid-invalid) protocol: aka MI Two states (per block in cache) • V (valid): have block • I (invalid): don't have block BR/BW⇒SD, WB⇒SD + Can implement with valid bit R⇒BR, W⇒BW Protocol diagram (left) Convention: event⇒generated-event Summary • If anyone wants to read/write block Give it up: transition to I state Write-back if your own copy is dirty This is an invalidate protocol V • Update protocol: copy data, don't invalidate R/W • Sounds good, but wastes a lot of bandwidth

51

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### VI Protocol (Write-Back Cache)

| Processor 0              | Processor 1                                               | CPU0 CPU1 Mer    |
|--------------------------|-----------------------------------------------------------|------------------|
| 0: addi \$r3,\$r1,&accts |                                                           | 500              |
| 1: lw \$r4,0(\$r3)       |                                                           | V:500 500        |
| 2: blt \$r4,\$r2,6       |                                                           |                  |
| 3: sub \$r4,\$r4,\$r2    |                                                           |                  |
| 4: sw \$r4,0(\$r3)       |                                                           | V:400 <b>500</b> |
| 5: jal dispense_cash     | 0: addi \$r3,\$r1,&accts                                  |                  |
|                          | 1: lw \$r4,0(\$r3)                                        | I: V:400 400     |
|                          | 2: blt \$r4,\$r2,6                                        |                  |
|                          | 3: sub \$r4,\$r4,\$r2                                     |                  |
|                          | 4: sw \$r4,0(\$r3)                                        | V:300 400        |
|                          | 5: jal dispense_cash                                      |                  |
|                          | generates a BR (bus rea<br>s by sending its dirty copy, t | -                |
|                          |                                                           | 52               |



### Exclusive Clean Protocol Optimization

| Processor 0                                | Processor 1                 | CPU0 CPU1 Mem               |
|--------------------------------------------|-----------------------------|-----------------------------|
| 0: addi \$r3,\$r1,&accts                   |                             | 500                         |
| 1: lw \$r4,0(\$r3)                         |                             | <b>E:500</b> 500            |
| 2: blt \$r4,\$r2,6                         |                             |                             |
| 3: sub \$r4,\$r4,\$r2                      |                             |                             |
| 4: sw \$r4,0(\$r3)                         | (No mis                     | s) M:400 500                |
| 5: jal dispense_cash                       | 0: addi \$r3,\$r1,&accts    | s                           |
|                                            | 1: lw \$r4,0(\$r3)          | S:400 S:400 400             |
|                                            | 2: blt \$r4,\$r2,6          |                             |
|                                            | 3: sub \$r4,\$r4,\$r2       |                             |
|                                            | 4: sw \$r4,0(\$r3)          | I: M:300 400                |
|                                            | 5: jal dispense_cash        |                             |
| <ul> <li>Most modern protocol</li> </ul>   | cols also include E (exc    | lusive) state               |
| <ul> <li>Interpretation: "I had</li> </ul> | we the only cached copy, an | d it's a <b>clean</b> copy" |
| <ul> <li>Why would this stat</li> </ul>    | e be useful?                |                             |
|                                            |                             |                             |
| CIS 371 (Martin/Roth): Shared Memor        | y Multiprocessors           | 55                          |

### MSI Protocol (Write-Back Cache)

| Processor 0                                                                                                                                | Processor 1                                      | CPU0 CPU1 Mem     |
|--------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------|-------------------|
| 0: addi \$r3,\$r1,&accts                                                                                                                   |                                                  | 500               |
| 1: lw \$r4,0(\$r3)                                                                                                                         |                                                  | S:500 500         |
| 2: blt \$r4,\$r2,6                                                                                                                         |                                                  |                   |
| 3: sub \$r4,\$r4,\$r2                                                                                                                      |                                                  |                   |
| 4: sw \$r4,0(\$r3)                                                                                                                         |                                                  | M:400 <b>500</b>  |
| 5: jal dispense_cash                                                                                                                       | 0: addi \$r3,\$r1,&accts                         |                   |
|                                                                                                                                            | 1: lw \$r4,0(\$r3)                               | S:400 S:400 400   |
|                                                                                                                                            | 2: blt \$r4,\$r2,6                               |                   |
|                                                                                                                                            | 3: sub \$r4,\$r4,\$r2                            |                   |
|                                                                                                                                            | 4: sw \$r4,0(\$r3)                               | I: M:300 400      |
|                                                                                                                                            | 5: jal dispense_cash                             |                   |
| <ul> <li>lw by processor 1 gene</li> <li>Processor 0 responds t</li> <li>sw by processor 1 gene</li> <li>Processor 0 responds t</li> </ul> | y sending its dirty copy, transiti<br>rates a BW | oning to <b>S</b> |
| CIS 371 (Martin/Roth): Shared Memory                                                                                                       | ,                                                | 54                |

### Cache Coherence and Cache Misses

| • A coherence protocol can effect a cache's miss rate (% <sub>miss</sub> )     |
|--------------------------------------------------------------------------------|
| Requests from other processors can invalidate (evict) local blocks             |
| 4C miss model: compulsory, capacity, conflict, coherence                       |
| Coherence miss: miss to a block evicted by bus event                           |
| As opposed to a processor event                                                |
|                                                                                |
| Cache parameters interact with coherence misses                                |
| <ul> <li>Larger capacity: more coherence misses</li> </ul>                     |
| <ul> <li>But offset by reduction in capacity misses</li> </ul>                 |
| <ul> <li>Increased block size: more coherence misses</li> </ul>                |
| <ul> <li>False sharing: "sharing" a cache line without sharing data</li> </ul> |
| Creates pathological "ping-pong" behavior                                      |
| Careful data placement may help, but is difficult                              |
|                                                                                |

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Cache Coherence and Cache Misses

• A coherence protocol can effect a cache's miss rate (%<sub>miss</sub>)

- Requests from other processors can invalidate (evict) local blocks
- 4C miss model: compulsory, capacity, conflict, coherence
- Coherence miss: miss to a block evicted by bus event
  - As opposed to a processor event
- Example: direct-mapped 4B cache, 1B blocks, 4-bit memory

### Cache contents (state:address)

|   | i j                                                  |            |                  |
|---|------------------------------------------------------|------------|------------------|
|   | Set00 Set01 Set10 Set11                              | Event      | Outcome          |
|   | S:0000, M:0001, S:0010, S:0011                       | Wr:0011    | Upgrade Miss     |
|   | S:0000, M:0001, S:0010, M:0011                       | BusRd:0000 | Nothing          |
|   | S:0000, M:0001, S:0010, M:0011                       | BusWr:0010 | S→I Invalidation |
|   | S:0000, M:0001, <b>I</b> :0010, M:0011               | Rd:1011    | Compulsory Miss  |
|   | S:0000, M:0001, I:0010, <b>S:1011</b>                | Rd:0010    | Coherence Miss   |
| _ | S:0000, M:0001, <b>S:0010</b> , S:1011               | Kd.0010    | Concretence Miss |
| _ | CIS 371 (Martin/Roth): Shared Memory Multiprocessors |            | 57               |
|   | CIS 5/1 (Hartin/Rour). Shared Heriory Hundprocessors |            | 57               |

### More Snooping Bandwidth Problems

| A   | lso processor snooping bandwidth                                                      |
|-----|---------------------------------------------------------------------------------------|
|     | • 0.01 events/insn * 2 insn/cycle = 0.02 events/cycle per processor                   |
|     | 16 processors: 0.32 bus-side tag lookups per cycle                                    |
|     | Add 1 port to cache tags? Sure                                                        |
|     | • Invalidate over upgrade: Tags smaller data, ports less expensive                    |
|     | 128 processors: 2.56 bus-side tag lookups per cycle!                                  |
|     | Add 3 ports to cache tags? Oy vey!                                                    |
|     | <ul> <li>Implementing inclusion (L1 is strict subset of L2) helps a little</li> </ul> |
|     | 2 additional ports on L2 tags only                                                    |
|     | Processor doesn't use existing tag port most of the time                              |
|     | • If L2 doesn't care (99% of the time), no need to bother L1                          |
|     | – Still kind of bad though                                                            |
| • ( | Ipshot: bus-based coherence doesn't scale well                                        |
|     |                                                                                       |

### Snooping Bandwidth Requirements

| Coherence events generated on                                               |    |
|-----------------------------------------------------------------------------|----|
| L2 misses (and writebacks)                                                  |    |
| Some parameters                                                             |    |
| • 2 GHz CPUs, 2 IPC, 33% memory operations,                                 |    |
| • 2% of which miss in the L2, 64B blocks, 50% dirty                         |    |
| • (0.33 * 0.02 * 1.5) = 0.01 events/insn                                    |    |
| • 0.01 events/insn * 2 insn/cycle * 2 cycle/ns = 0.04 events/ns             |    |
| <ul> <li>Address request: 0.04 events/ns * 4 B/event = 0.16 GB/s</li> </ul> |    |
| <ul> <li>Data response: 0.04 events/ns * 64 B/event = 2.56 GB/s</li> </ul>  |    |
| That's 2.5 GB/s per processor                                               |    |
| With 16 processors, that's 40 GB/s!                                         |    |
| With 128 processors, that's 320 GB/s!!                                      |    |
| You can use multiple buses but that hinders global ordering                 |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 5                      | 58 |





### Directory Coherence Protocols

| Observe: address space statically partitioned                                         |
|---------------------------------------------------------------------------------------|
| + Can easily determine which memory module holds a given line                         |
| That memory module sometimes called "home"                                            |
| <ul> <li>Can't easily determine which processors have line in their caches</li> </ul> |
| Bus-based protocol: broadcast events to all processors/caches                         |
| ± Simple and fast, but non-scalable                                                   |
| Directories: non-broadcast coherence protocol                                         |
| Extend memory to track caching information                                            |
| For each physical cache line whose home this is, track:                               |
| • Owner: which processor has a dirty copy (I.e., M state)                             |
| Sharers: which processors have clean copies (I.e., S state)                           |
| Processor sends coherence event to home directory                                     |
| Home directory only sends events to processors that care                              |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 62                               |

### **MSI Directory Protocol**



### Directory MSI Protocol

| Processor 0                               | Processor 1                                 | P0 P1 Director           |
|-------------------------------------------|---------------------------------------------|--------------------------|
| 0: addi r1,accts,r3                       |                                             | _:_:500                  |
| 1: ld 0(r3),r4                            |                                             |                          |
| 2: blt r4,r2,6                            |                                             | S:500 S:0:500            |
| 3: sub r4,r2,r4                           |                                             |                          |
| 4: st r4,0(r3)                            |                                             | M:400                    |
| 5: call dispense_cash                     | 0: addi r1,accts,r3                         | M:400 M:0:500<br>(stale) |
|                                           | 1: ld 0(r3),r4                              | (otato)                  |
|                                           | 2: blt r4,r2,6                              | S:400 S:400 S:0,1:40     |
|                                           | 3: sub r4,r2,r4                             |                          |
|                                           | 4: st r4,0(r3)                              |                          |
|                                           | 5: call dispense_cas                        | M:300 M:1:400            |
| • 1d by P1 sends BR to                    | o directory<br>P0, P0 sends P1 data, does V | WB goes to S             |
|                                           |                                             | vD, goes to <b>3</b>     |
| • st by P1 sends BW t                     | · · · · · · · · · · · · · · · · · · ·       |                          |
| <ul> <li>Directory sends BW to</li> </ul> | P0, P0 goes to <b>I</b>                     |                          |
| CIS 371 (Martin/Roth): Shared Memory      | / Multiprocessors                           | 64                       |

### Directory Flip Side: Latency



### Coherence on Real Machines

| • | Many | unipr | ocesso | rs desig | ned with | on-chip | snooping lo | ogic |
|---|------|-------|--------|----------|----------|---------|-------------|------|
|   |      |       |        | 5        |          |         |             | 5    |

- Can be easily combined to form multi-processors
  - E.g., Intel Pentium4 Xeon
- Multi-core
- Larger scale (directory) systems built from smaller MPs
   E.g., Sun Wildfire, NUMA-Q, IBM Summit
- Some shared memory machines are **not cache coherent** 
  - E.g., CRAY-T3D/E
  - Shared data is uncachable
  - If you want to cache shared data, copy it to private data section
  - Basically, cache coherence implemented in software
    - Have to really know what you are doing as a programmer

67

### Directory Flip Side: Complexity

- Latency not only issue for directories
  - Subtle correctness issues as well
  - Stem from unordered nature of underlying inter-connect
- Individual requests to single cache must be ordered
  - Bus-based Snooping: all processors see all requests in same order
    - Ordering automatic
  - Point-to-point network: requests may arrive in different orders
    - Directory has to enforce ordering explicitly
    - Cannot initiate actions on request B...
    - Until all relevant processors have completed actions on request A

66

- Requires directory to collect acks, queue requests, etc.
- Directory protocols
  - Obvious in principle
  - Complicated in practice
- CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Best of Both Worlds?

| • | Ignore processor snooping bandwidth for a minute                       |
|---|------------------------------------------------------------------------|
|   | Can we combine best features of snooping and directories               |
|   |                                                                        |
|   | From snooping: fast two-hop cache-to-cache transfers                   |
|   | <ul> <li>From directories: scalable point-to-point networks</li> </ul> |
|   | In other words                                                         |
| Þ | Can we use broadcast on an unordered network?                          |
|   | Yes, and most of the time everything is fine                           |
|   | But sometimes it isn't protocol race                                   |
| • | Research Proposal: Token Coherence (TC)                                |
|   | • An unordered broadcast snooping protocol without data races          |
|   | 5 371 (Martin/Roth): Shared Memory Multiprocessors 68                  |

### Roadmap Checkpoint

| <ul> <li>Shared memory model <ul> <li>Multiplexed uniprocessor</li> <li>Hardware multihreading</li> <li>Multiprocessing</li> <li>Synchronization</li> <li>Lock implementation</li> <li>Locking gotchas</li> </ul> </li> <li>Cache coherence <ul> <li>Bus-based protocols</li> <li>Directory protocols</li> <li>Memory consistency models</li> </ul> </li> </ul> | System software | Thread-level parallelism (TLP)               |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------------------------------------------|
| Hardware multihreading     Multiprocessing     Synchronization     Lock implementation     Locking gotchas     Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                  |                 | 7                                            |
| Hardware multihreading     Multiprocessing     Synchronization     Lock implementation     Locking gotchas     Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                  |                 | <ul> <li>Multiplexed uniprocessor</li> </ul> |
| Multiprocessing     Synchronization     Lock implementation     Locking gotchas     Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                                             |                 | Hardware multihreading                       |
| Lock implementation     Locking gotchas     Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                                                                                     |                 | Multiprocessing                              |
| Locking gotchas     Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                                                                                                             |                 | Synchronization                              |
| Cache coherence     Bus-based protocols     Directory protocols                                                                                                                                                                                                                                                                                                 |                 | <ul> <li>Lock implementation</li> </ul>      |
| Bus-based protocols     Directory protocols                                                                                                                                                                                                                                                                                                                     |                 | Locking gotchas                              |
| Directory protocols                                                                                                                                                                                                                                                                                                                                             |                 | Cache coherence                              |
|                                                                                                                                                                                                                                                                                                                                                                 |                 | <ul> <li>Bus-based protocols</li> </ul>      |
| Memory consistency models                                                                                                                                                                                                                                                                                                                                       |                 | Directory protocols                          |
|                                                                                                                                                                                                                                                                                                                                                                 |                 | Memory consistency models                    |
|                                                                                                                                                                                                                                                                                                                                                                 |                 |                                              |

### Recall: Write Misses and Write Buffers

| <ul> <li>Read miss?</li> <li>Load can't go on without the data, it must stall</li> <li>Write miss?</li> <li>Technically, no instruction is waiting for data, why stall?</li> </ul>                                                                                                                                                    | Processor           |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|
| <ul> <li>Write buffer: a small buffer</li> <li>Stores put address/value to write buffer, keep going</li> <li>Write buffer writes stores to D\$ in the background</li> <li>Loads must search write buffer (in addition to D\$)</li> <li>+ Eliminates stalls on write misses (mostly)</li> <li>Creates some problems (later)</li> </ul> | WB<br>Cache         |
| <ul> <li>Write buffer vs. writeback-buffer</li> <li>Write buffer: "in front" of D\$, for hiding store misses</li> <li>Writeback buffer: "behind" D\$, for hiding writebacks</li> <li>CIS 371 (Martin/Roth): Shared Memory Multiprocessors</li> </ul>                                                                                  | Next-level<br>cache |



### Memory Consistency

| Memory coherence                                                             |    |
|------------------------------------------------------------------------------|----|
| Creates globally uniform (consistent) view                                   |    |
| <ul> <li>Of a single memory location (in other words: cache line)</li> </ul> | )  |
| – Not enough                                                                 |    |
| Cache lines A and B can be individually consistent                           |    |
| But inconsistent with respect to each other                                  |    |
|                                                                              |    |
| Memory consistency                                                           |    |
| Creates globally uniform (consistent) view                                   |    |
| Of all memory locations relative to each other                               |    |
|                                                                              |    |
| Who cares? Programmers                                                       |    |
| <ul> <li>Globally inconsistent memory creates mystifying behavior</li> </ul> |    |
| , , , , , , , , , , , , , , , , , , , ,                                      |    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                         | 72 |

### Coherence vs. Consistency

| A=flag=            | =0 ;                   |
|--------------------|------------------------|
| Processor 0        | Processor 1            |
| A=1;               | while (!flag); // spin |
| <pre>flag=1;</pre> | print A;               |
|                    |                        |

- Intuition says: P1 prints A=1
- Coherence says: absolutely nothing
  - P1 can see P0's write of flag before write of A!!! How?
    - Maybe coherence event of  ${\bf A}$  is delayed somewhere in network
    - Or P0 has a coalescing write buffer that reorders writes

73

75

- Imagine trying to figure out why this code sometimes "works" and sometimes doesn't
- Real systems act in this strange manner

| CIS 371 | (Martin/R | oth): Sha | red Memo | ory Multiproce | ssors |
|---------|-----------|-----------|----------|----------------|-------|
|         |           |           |          |                |       |

### SC Doesn't "Happen Naturally" Why?

- What is consistency concerned with?
  - P1 doesn't actually view P0's committed loads and stores
  - Views their coherence events instead
  - "Consistency model": how observed order of coherence events relates to order of committed insns
- What does SC say?
  - Coherence event order must match committed insn order
    And be identical for all processors
  - Let's see what that implies

### Sequential Consistency (SC)

| A=f                                                                                                 | lag=0;                                                                                                                                                                                                                                                                                                                        |
|-----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Processor 0                                                                                         | Processor 1                                                                                                                                                                                                                                                                                                                   |
| A=1;                                                                                                | while (!flag); // spin                                                                                                                                                                                                                                                                                                        |
| flag=1;                                                                                             | print A;                                                                                                                                                                                                                                                                                                                      |
| Sequentia                                                                                           | Il consistency (SC)                                                                                                                                                                                                                                                                                                           |
| <ul> <li>Processor<br/>+ Provid</li> <li>But also:</li> <li>And finally</li> <li>Last to</li> </ul> | lefinition of memory view programmers expect<br>s see their own loads and stores in program order<br>ed naturally, even with out-of-order execution<br>processors see others' loads and stores in program order<br>y: all processors see same global load/store ordering<br>wo conditions not naturally enforced by coherence |
|                                                                                                     | lefinition: multiprocessor ordering                                                                                                                                                                                                                                                                                           |
| <ul> <li>Correspor</li> </ul>                                                                       | nds to some sequential interleaving of uniprocessor orders                                                                                                                                                                                                                                                                    |
| • I.e., indi<br>processo                                                                            | stinguishable from multi-programmed uni-<br>pr                                                                                                                                                                                                                                                                                |
| CIS 371 (Martin/Roth):                                                                              | Shared Memory Multiprocessors 74                                                                                                                                                                                                                                                                                              |

### SC + Write Buffers

| Store misses are slow                                                         |  |
|-------------------------------------------------------------------------------|--|
| Global acquisition of M state (write permission)                              |  |
| <ul> <li>Multiprocessors have more store misses than uniprocessors</li> </ul> |  |
| Upgrade miss: I have block in S, require global upgrade to M                  |  |
|                                                                               |  |
| Apparent solution: write buffer                                               |  |
| Commit store to write buffer, let it absorb store miss latency                |  |
| But a write buffer means                                                      |  |
| I see my own stores commit before everyone else sees them                     |  |
|                                                                               |  |
|                                                                               |  |
|                                                                               |  |
|                                                                               |  |
|                                                                               |  |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors 76                       |  |
|                                                                               |  |

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### SC + Write Buffers

| A=0; B=0;                   |                             |
|-----------------------------|-----------------------------|
| Processor 0                 | Processor 1                 |
| A=1; // in-order to WB      | B=1; // in-order to WB      |
| if(B==0) // in-order commit | if(A==0) // in-order commit |
| A=1; // in-order to D\$     | B=1; // in-order to D\$     |
|                             |                             |

• Possible for both (B==0) and (A==0) to be true

- Because **B=1** and **A=1** are just sitting in the write buffers
  - Which is wrong
  - So does SC mean no write buffer?
  - Yup, and that hurts

| CIS 371 (Martin/Roth): Shared Memory Multiprocessors |
|------------------------------------------------------|
|------------------------------------------------------|

### Weak Memory Ordering

- For properly synchronized programs...
- ...only acquires/releases must be strictly ordered
- Why? acquire-release pairs define critical sections
  - Between critical-sections: data is private
     Globally unordered access OK
  - Within critical-section: access to shared data is exclusive
  - Globally unordered access also OK
  - Implication: compiler or dynamic scheduling is OK
    - As long as re-orderings do not cross synchronization points

| • | Weak O | rdering | (WO): / | Alpha, | IA-64, | PowerPC |
|---|--------|---------|---------|--------|--------|---------|
|---|--------|---------|---------|--------|--------|---------|

- ISA provides fence insns to indicate scheduling barriers
- Proper use of fences is somewhat subtle

### CIS 371 (Martin/Roth): Shared Memory Multiprocessors

### Is SC Really Necessary?

• SC

77

79

- + Most closely matches programmer's intuition (don't under-estimate)
- Restricts optimization by compiler, CPU, memory system
- Supported by MIPS, HP PA-RISC
- Is full-blown SC really necessary? What about...
  - All processors see others' loads/stores in program order
  - But not all processors have to see same global order
  - + Allows processors to have in-order write buffers
  - Doesn't confuse programmers too much
    - Synchronized programs (e.g., our example) work as expected
  - Processor Consistency (PC): e.g., Intel iA32, SPARC

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

| _ | P | 0 | р | Q | u | İΖ |  |  |  |
|---|---|---|---|---|---|----|--|--|--|
|   |   |   |   |   |   |    |  |  |  |

| <ul> <li>Answer the following two quest</li> </ul>                                   | tions:                                                                                                      |
|--------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|
| Initially: x==<br>thread 1                                                           | 0, y==0<br>thread 2                                                                                         |
| ld x<br>ld y                                                                         | $\begin{array}{c} \texttt{st 1} \rightarrow \texttt{y} \\ \texttt{st 1} \rightarrow \texttt{x} \end{array}$ |
| •What value pairs can be read by the<br>•(x, y) pairs:                               | e two loads?                                                                                                |
| Initially: x==                                                                       | 0, <b>y==</b> 0                                                                                             |
| thread 1                                                                             | thread 2                                                                                                    |
| $\begin{array}{c} \texttt{st 1} \rightarrow \texttt{y} \\ \texttt{ld x} \end{array}$ | st 1 $\rightarrow \mathbf{x}$<br>ld y                                                                       |

What value pairs can be read by the two loads?
 (x, y) pairs:

CIS 371 (Martin/Roth): Shared Memory Multiprocessors

78

### Fences aka Memory Barriers

| • Fences (memory barriers): special insns                                   |           |
|-----------------------------------------------------------------------------|-----------|
| <ul> <li>Ensure that loads/stores don't cross acquire release bo</li> </ul> | oundaries |
| Very roughly                                                                |           |
| acquire                                                                     |           |
| fence                                                                       |           |
| critical section                                                            |           |
| fence                                                                       |           |
| release                                                                     |           |
|                                                                             |           |
| How do they work?                                                           |           |
| <ul> <li>fence insn must commit before any younger insn disp</li> </ul>     | oatches   |
| <ul> <li>This also means write buffer is emptied</li> </ul>                 |           |
| <ul> <li>Makes lock acquisition and release slow(er)</li> </ul>             |           |
| <ul> <li>Use synchronization library, don't write yo</li> </ul>             | ur own    |
| CIS 371 (Martin/Roth): Shared Memory Multiprocessors                        | 81        |

### Summary

