### CIS 501 Introduction To Computer Architecture

Unit 11: Shared-Memory Multiprocessors

CIS 501 (Martin/Roth): Shared Memory Multiprocessors



### This Unit: Shared Memory Multiprocessors

| Application<br>OS<br>Compiler Firmware<br>CPU I/O<br>Memory<br>Digital Circuits<br>Gates & Transistors | <ul> <li>Three issues <ul> <li>Cache coherence</li> <li>Synchronization</li> <li>Memory consistency</li> </ul> </li> <li>Two cache coherence approaches <ul> <li>"Snooping" (SMPs): &lt; 16 processors</li> <li>"Directory"/Scalable: lots of processors</li> </ul> </li> </ul> |
|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CIS 501 (Martin/Roth): Shared Memory                                                                   | Multiprocessors 2                                                                                                                                                                                                                                                               |

### Thread-Level Parallelism

| <pre>struct acct_t { int bal; }; shared struct acct t accts[MAX ACCT]</pre> | ,                   |
|-----------------------------------------------------------------------------|---------------------|
| int id, amt;                                                                | 0: addi r1,accts,r3 |
| <pre>if (accts[id].bal &gt;= amt)</pre>                                     | 1: ld 0(r3),r4      |
| {                                                                           | 2: blt r4,r2,6      |
| <pre>accts[id].bal -= amt;</pre>                                            | 3: sub r4,r2,r4     |
| <pre>spew cash();</pre>                                                     | 4: st r4,0(r3)      |
| }                                                                           | 5: call spew cash   |
|                                                                             |                     |

### Thread-level parallelism (TLP)

- Collection of asynchronous tasks: not started and stopped together
- Data shared loosely, dynamically
- Example: database/web server (each query is a thread)
  - accts is shared, can't register allocate even if it were scalar
- id and amt are private variables, register allocated to r1, r2
- Focus on this

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### **Shared Memory**

# Shared memory Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores Simple software No need for messages, communication happens naturally Maybe too naturally Supports irregular, dynamic communication patterns Both DLP and TLP Complex hardware Must create a uniform view of memory Several aspects to this as we will see

# Shared-Memory Multiprocessors • Provide a shared-memory abstraction • amiliar and efficient for programmers P1 P2 P3 P4 Memory System Ct 501 (Martin/Roth): Shared Memory Multiprocessors

### Shared-Memory Multiprocessors



### Paired vs. Separate Processor/Memory?

- Separate processor/memory
   Uniform memory access (UMA): equal latency to all memory
  - + Simple software, doesn't matter where you put data
  - Lower peak performance
  - Bus-based UMAs common: symmetric multi-processors (SMP)
- Paired processor/memory
  - Non-uniform memory access (NUMA): faster to local memory
  - More complex software: where you put data matters
  - + Higher peak performance: assuming proper data placement



### Shared vs. Point-to-Point Networks

- Shared network: e.g., bus (left)
  - + Low latency
  - Low bandwidth: doesn't scale beyond  $\sim 16$  processors
  - + Shared property simplifies cache coherence protocols (later)
- Point-to-point network: e.g., mesh or ring (right)
  - Longer latency: may need multiple "hops" to communicate
  - + Higher bandwidth: scales to 1000s of processors
  - Cache coherence protocols are complex



# Implementation #1: Snooping Bus MP CPU(\$) U(\$) CPU(\$) CPU(\$) Mem uppercessors opuint the point opuint o

### **Organizing Point-To-Point Networks**

- Network topology: organization of network
  - Tradeoff performance (connectivity, latency, bandwidth) ↔ cost
- Router chips
  - Networks that require separate router chips are indirect
  - Networks that use processor/memory/router packages are direct + Fewer components, "Glueless MP"
- Point-to-point network examples
  - Indirect tree (left)
  - Direct mesh or ring (right)





CIS 501 (Martin/Roth): Shared Memory Multiprocessors







- General point-to-point network-based systems
  - Typically processor/memory/router blocks (NUMA)
    - Glueless MP: no need for additional "glue" chips
  - Can be arbitrarily large: 1000's of processors
    - Massively parallel processors (MPPs)
  - In reality only government (DoD) has MPPs...
    - Companies have much smaller systems: 32–64 processors
    - Scalable multi-processors

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### Issues for Shared Memory Systems

Three in particular
Cache coherence
Synchronization
Memory consistency model
Not unrelated to each other
Different solutions for SMPs and MPPs

### No-Cache, No-Problem

| Processor 0                      | Processor 1          |     |
|----------------------------------|----------------------|-----|
| 0: addi r1,accts,r3              |                      | 500 |
| 1: ld 0(r3),r4                   |                      | 500 |
| 2: blt r4,r2,6                   |                      |     |
| 3: sub r4,r2,r4                  |                      |     |
| 4: st r4,0(r3)                   |                      | 400 |
| 5: call spew_cash                | 0: addi r1,accts,r3  |     |
|                                  | 1: ld 0(r3),r4       | 400 |
|                                  | 2: blt r4,r2,6       |     |
|                                  | 3: sub r4,r2,r4      |     |
|                                  | 4: st r4,0(r3)       | 300 |
|                                  | 5: call spew_cash    |     |
| Scenario I: proces               | ssors have no caches |     |
| <ul> <li>No problem</li> </ul>   |                      |     |
|                                  |                      |     |
| CIS 501 (Martin/Roth): Shared Me | mony Multiprocessors | 15  |

### An Example Execution

| Processor 0         | Processor 1                     |         | 00114 |     |
|---------------------|---------------------------------|---------|-------|-----|
| 0: addi r1,accts,r3 |                                 | CPU0    | CPU1  | Mem |
| 1: ld 0(r3),r4      |                                 |         |       |     |
| 2: blt r4,r2,6      |                                 |         |       |     |
| 3: sub r4,r2,r4     |                                 |         |       |     |
| 4: st r4,0(r3)      |                                 |         |       |     |
| 5: call spew_cash   | 0: addi r1,accts,r3             |         |       |     |
|                     | 1: ld 0(r3),r4                  |         |       |     |
|                     | 2: blt r4,r2,6                  |         |       |     |
|                     | 3: sub r4,r2,r4                 |         |       |     |
|                     | 4: st r4,0(r3)                  |         |       |     |
|                     | 5: call spew_cash               |         |       |     |
| Two \$100 withdray  | vals from account #241 a        | t two / | ATMs  |     |
| -++++++++           | aps to thread on different proc |         |       |     |
|                     | .bal (address is in r3)         | C3301   |       |     |

14

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### Cache Incoherence

| <u>Processor 0</u>                       | Processor 1                 |                |         |
|------------------------------------------|-----------------------------|----------------|---------|
| 0: addi r1,accts,r3                      |                             |                | 500     |
| 1: ld 0(r3),r4                           |                             | V:500          | 500     |
| 2: blt r4,r2,6                           |                             |                |         |
| 3: sub r4,r2,r4                          |                             |                |         |
| 4: st r4,0(r3)                           |                             | D:400          | 500     |
| 5: call spew_cash                        | 0: addi r1,accts,r3         |                |         |
|                                          | 1: ld 0(r3),r4              | D:400 V:5      | 500 500 |
|                                          | 2: blt r4,r2,6              |                |         |
|                                          | 3: sub r4,r2,r4             |                |         |
|                                          | 4: st r4,0(r3)              | D:400 D:4      | 00 500  |
|                                          | 5: call spew_cash           |                |         |
| Scenario II: proce                       | essors have write-back ca   | aches          |         |
| <ul> <li>Potentially 3 copies</li> </ul> | es of accts [241] .bal: mer | nory, p0\$, p1 | \$      |
| Can get incohere                         | nt (inconsistent)           |                |         |
|                                          |                             |                | 16      |

### Write-Thru Doesn't Help

| Processor 0         | Processor 1         |                 |
|---------------------|---------------------|-----------------|
| 0: addi r1,accts,r3 |                     | 500             |
| 1: ld 0(r3),r4      |                     | V:500 500       |
| 2: blt r4,r2,6      |                     |                 |
| 3: sub r4,r2,r4     |                     |                 |
| 4: st r4,0(r3)      |                     | V:400 400       |
| 5: call spew_cash   | 0: addi r1,accts,r3 |                 |
|                     | 1: ld 0(r3),r4      | V:400 V:400 400 |
|                     | 2: blt r4,r2,6      |                 |
|                     | 3: sub r4,r2,r4     |                 |
|                     | 4: st r4,0(r3)      | V:400 V:300 300 |
|                     | 5: call spew cash   |                 |

- - This time only 2 (different) copies of accts[241].bal
  - No problem? What if another withdrawal happens on processor 0?

| CIS 501 (Martin/Roth): Shared Memory Multiprocessors | 17 |  |
|------------------------------------------------------|----|--|
|                                                      |    |  |

### What To Do?

| <ul> <li>Have already seen this problem before with DM.</li> <li>DMA controller acts as another processor</li> </ul> | A          |
|----------------------------------------------------------------------------------------------------------------------|------------|
| Changes cached versions behind processor's back                                                                      |            |
| Possible solutions                                                                                                   |            |
| No caches? Slow                                                                                                      |            |
| Make shared data uncachable? Still slow                                                                              |            |
| • Timely flushes and wholesale invalidations? Slow, but                                                              | OK for DMA |
| Hardware cache coherence                                                                                             |            |
| + Minimal flushing, maximum caching $\rightarrow$ best perfor                                                        | mance      |
|                                                                                                                      |            |
|                                                                                                                      |            |
|                                                                                                                      |            |
| IS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                  | 18         |

### Hardware Cache Coherence

| CPU                         | <ul> <li>Absolute coherence</li> <li>All copies have same data at all times</li> <li>Hard to implement and slow</li> <li>+ Not strictly necessary</li> </ul> |     |
|-----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| data data                   | Relative coherence                                                                                                                                           |     |
|                             | <ul><li>Temporary incoherence OK (e.g., write-bac</li><li>As long as all loads get right values</li></ul>                                                    | :k) |
|                             | Coherence controller:                                                                                                                                        |     |
| bus                         | • Examines bus traffic (addresses and data)                                                                                                                  |     |
|                             | Executes coherence protocol                                                                                                                                  |     |
|                             | <ul> <li>What to do with local copy when you so<br/>different things happening on bus</li> </ul>                                                             | ee  |
| CIS 501 (Martin/Roth): Shar | ed Memory Multiprocessors                                                                                                                                    | 19  |

### **Bus-Based Coherence Protocols**

| <ul> <li>Bus-based coherence protocols</li> </ul>              |                    |
|----------------------------------------------------------------|--------------------|
| <ul> <li>Also called snooping or broadcast</li> </ul>          |                    |
| ALL controllers see ALL transactions IN                        | SAME ORDER         |
| <ul> <li>Protocol relies on this</li> </ul>                    |                    |
| Three processor-side events                                    |                    |
| • R: read                                                      |                    |
| • W: write                                                     |                    |
| • WB: write-back                                               |                    |
| Two bus-side events                                            |                    |
| • BR: bus-read, read miss on another proce                     | essor              |
| • BW: bus-write, write miss or write-back o                    | n another processo |
|                                                                |                    |
| <ul> <li>Point-to-point network protocols also exis</li> </ul> | st                 |
| Called directories                                             |                    |
|                                                                |                    |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors           | 20                 |
| CIS SUI (Marun/Rour). Shareu Memory Multiprocessors            | 20                 |

### VI (MI) Coherence Protocol

|                           | • VI (valid-invalid) protocol: aka MI                          |
|---------------------------|----------------------------------------------------------------|
|                           | an a                       |
|                           | Two states (per block)                                         |
|                           | • V (valid): have block                                        |
|                           | <ul> <li>I (invalid): don't have block</li> </ul>              |
|                           | + Can implement with valid bit                                 |
| Å Å                       | <ul> <li>Protocol diagram (left)</li> </ul>                    |
| R⇒BR, W⇒BW<br>3R/BW⇒WB, W | <ul> <li>Convention: event⇒generated-event</li> </ul>          |
| a l                       | Summary                                                        |
| R⇒BR, I                   | <ul> <li>If anyone wants to read/write block</li> </ul>        |
|                           | <ul> <li>Give it up: transition to I state</li> </ul>          |
| - <b>y</b> -              | <ul> <li>Write-back if your own copy is dirty</li> </ul>       |
|                           | <ul> <li>This is an invalidate protocol</li> </ul>             |
|                           | Update protocol: copy data, don't invalidate                   |
| R/W                       | <ul> <li>Sounds good, but wastes a lot of bandwidth</li> </ul> |
|                           |                                                                |
| CIS 501 (Martin/Roth      | ): Shared Memory Multiprocessors 21                            |

# VI Protocol (Write-Back Cache)

| Processor 0                        | Processor 1                 |                   |                |
|------------------------------------|-----------------------------|-------------------|----------------|
| 0: addi r1,accts,r3                |                             |                   | 500            |
| 1: ld 0(r3),r4                     |                             | V:500             | 500            |
| 2: blt r4,r2,6                     |                             |                   |                |
| 3: sub r4,r2,r4                    |                             |                   |                |
| 4: st r4,0(r3)                     |                             | V:400             | 500            |
| 5: call spew_cash                  | 0: addi r1,&accts,r         | 3                 |                |
|                                    | 1: ld 0(r3),r4              | <b>I:WB</b> V:4   | 400            |
|                                    | 2: blt r4,r2,6              |                   |                |
|                                    | 3: sub r4,r2,r4             |                   |                |
|                                    | 4: st r4,0(r3)              | V:3               | 300 <b>400</b> |
|                                    | 5: call spew_cash           |                   |                |
| • 1d by processor                  | L generates a BR            |                   |                |
| <ul> <li>processor 0 re</li> </ul> | sponds by WB its dirty copy | , transitioning t | :o <b>I</b>    |
|                                    |                             |                   |                |
| CIS 501 (Martin/Roth): Shared Mer  | nory Multiprocessors        |                   | 22             |

### $VI \rightarrow MSI$ BR/BW • VI protocol is inefficient - Only one cached copy allowed in entire system - Multiple copies can't exist even if read-only • Not a problem in example • Big problem in reality BW⇒WB, WB⇒ BR MSI (modified-shared-invalid) BW • Fixes problem: splits "V" state into two states W⇒BW • M (modified): local dirty copy • S (shared): local clean copy Allows either W⇒BW • Multiple read-only copies (S-state) --OR--S Μ • Single read/write copy (M-state) BR⇒WB R/W R/BR

23

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### MSI Protocol (Write-Back Cache)

| Processor 0                         | Processor 1                                           |                 |            |
|-------------------------------------|-------------------------------------------------------|-----------------|------------|
| 0: addi r1,accts,r3                 |                                                       |                 | 500        |
| 1: ld 0(r3),r4                      |                                                       | S:500           | 500        |
| 2: blt r4,r2,6                      |                                                       |                 |            |
| 3: sub r4,r2,r4                     |                                                       |                 |            |
| 4: st r4,0(r3)                      |                                                       |                 |            |
| 5: call spew_cash                   | 0: addi r1,accts,r3                                   | M:400           | 500        |
|                                     | 1: ld 0(r3),r4                                        | S:400 S:4       | 00 400     |
|                                     | 2: blt r4,r2,6                                        |                 |            |
|                                     | 3: sub r4,r2,r4                                       |                 |            |
|                                     | 4: st r4,0(r3)                                        |                 |            |
|                                     | 5: call spew_cash                                     | I: M:3          | <b>400</b> |
| • 1d by processor                   | T 2 1 2 2 1 2 2 1 2 3 1 2 3 1 2 3 3 1 3 3 3 1 3 3 3 3 |                 |            |
| <ul> <li>processor 0 re</li> </ul>  | sponds by WB its dirty copy, t                        | transitioning t | o <b>S</b> |
| <ul> <li>st by processor</li> </ul> | 1 generates a BW                                      |                 |            |
| <ul> <li>processor 0 re</li> </ul>  | sponds by transitioning to <b>I</b>                   |                 |            |
| CIS 501 (Martin/Roth): Shared Me    |                                                       |                 | 24         |

### A Protocol Optimization



### Cache Coherence and Cache Misses

- A coherence protocol can effect a cache's miss rate (%<sub>miss</sub>)
  - Requests from other processors can invalidate (evict) local blocks
  - 4C miss model: compulsory, capacity, conflict, coherence
  - Coherence miss: miss to a block evicted by bus event
    - As opposed to a processor event
  - Example: direct-mapped 16B cache, 4B blocks, nibble notation

| Cache contents (state:address)                       | Event      | Outcome          |
|------------------------------------------------------|------------|------------------|
| S:0000, M:0010, S:0020, S:0030                       | Wr:0030    | Upgrade Miss     |
| S:0000, M:0010, S:0020, M:0030                       |            | 15               |
| S:0000, M:0010, S:0020, M:0030                       | BusRd:0000 | Nothing          |
| S:0000, M:0010, <b>I</b> :0020, M:0030               | BusWr:0020 | S→I Invalidation |
| S:0000, M:0010, I:0020, S:3030                       | Rd:3030    | Compulsory Miss  |
| S:0000, M:0010, <b>S:0020</b> , S:3030               | Rd:0020    | Coherence Miss   |
| S:0000, M:0010, S:0020, S:0030                       | Rd:0030    | Conflict Miss    |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors |            | 27               |

### Another Protocol Optimization

- Most modern protocols also include E (exclusive) state
  - Interpretation: can write to this block, but haven't yet
  - Why is this state useful?



### Cache Coherence and Cache Misses

| Cache parameters interact with coheren                                                                        | ce misses           |
|---------------------------------------------------------------------------------------------------------------|---------------------|
| <ul> <li>Larger capacity: more coherence misses</li> </ul>                                                    |                     |
| <ul> <li>But offset by reduction in capacity misse</li> </ul>                                                 | S                   |
| <ul> <li>Increased block size: more coherence misse</li> </ul>                                                | S                   |
| <ul> <li>False sharing: "sharing" a cache line w</li> </ul>                                                   | ithout sharing data |
| <ul> <li>Creates pathological "ping-pong" behavior</li> </ul>                                                 | or                  |
| <ul> <li>Careful data placement may help, but is</li> </ul>                                                   | difficult           |
| Number of processors also affects coho                                                                        | onco missos         |
| Number of processors also affects cohe                                                                        | rence misses        |
| Number of processors also affects cohe<br>– More processors: more coherence misses                            | rence misses        |
|                                                                                                               | rence misses        |
|                                                                                                               | rence misses        |
|                                                                                                               | rence misses        |
| <ul> <li>Number of processors also affects coher</li> <li>– More processors: more coherence misses</li> </ul> | rence misses        |

| Coherence Bandwidth Requirements                                                  |    |
|-----------------------------------------------------------------------------------|----|
| How much address bus bandwidth does snooping need?                                |    |
| Well, coherence events generated on                                               |    |
| <ul> <li>Misses (only in L2, not so bad)</li> </ul>                               |    |
| Dirty replacements                                                                |    |
| Some parameters                                                                   |    |
| • 2 GHz CPUs, 2 IPC, 33% memory operations,                                       |    |
| <ul> <li>2% of which miss in the L2, 50% of evictions are dirty</li> </ul>        |    |
| <ul> <li>(0.33 * 0.02) + (0.33 * 0.02 * 0.50)) = 0.01 events/insn</li> </ul>      |    |
| <ul> <li>0.01 events/insn * 2 insn/cycle * 2 cycle/ns = 0.04 events/ns</li> </ul> |    |
| <ul> <li>Request: 0.04 events/ns * 4 B/event = 0.16 GB/s = 160 MB/s</li> </ul>    |    |
| <ul> <li>Data Response: 0.04 events/ns * 64 B/event = 2.56 GB/s</li> </ul>        |    |
| That's 2.5 GB/s per processor                                                     |    |
| With 16 processors, that's 40 GB/s!                                               |    |
| <ul> <li>With 128 processors, that's 320 GB/s!!</li> </ul>                        |    |
| Yes, you can use multiple buses but that hinders global ordering                  |    |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                              | 29 |

### More Coherence Bandwidth

| <ul> <li>Also processor snooping bandwid</li> </ul>                                                |                           |
|----------------------------------------------------------------------------------------------------|---------------------------|
| <ul> <li>Recall: snoop implies matching address a</li> </ul>                                       | gainst current cache tag  |
| <ul> <li>Just a tag lookup, not data</li> </ul>                                                    |                           |
| <ul> <li>0.01 events/insn * 2 insn/cycle = 0.01 ev</li> </ul>                                      | ents/cycle per processo   |
| <ul> <li>With 16 processors, each would do 0.16 t<br/>± Add a port to the cache tags OK</li> </ul> | tag lookups per cycle     |
| • With 128 processors, each would do 1.28                                                          | tag lookups per cycle     |
| • If caches implement inclusion (L1 is                                                             | strict subset of L2)      |
| <ul> <li>Additional snooping ports only needed</li> </ul>                                          | d on L2, still bad though |
| <ul> <li>Upshot: bus-based coherence doesn</li> </ul>                                              | 't scale beyond 8–16      |
|                                                                                                    |                           |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                               | 30                        |

### Scalable Cache Coherence

### • Part I: bus bandwidth

- Replace non-scalable bandwidth substrate (bus)...
- ...with scalable bandwidth one (point-to-point network, e.g., mesh)

### • Part II: processor snooping bandwidth

- Interesting: most snoops result in no action
  - For loosely shared data, other processors probably
- Replace non-scalable broadcast protocol (spam everyone)...
- ...with scalable **directory protocol** (only spam processors that care)

| CI | S 50 | 1 (N | 1arti | in/R | oth) | : Sh | arec | d Me | emo | ry M | lultip | oroc | esso | ors |  |  |  |  |  | 3 | 31 |  |
|----|------|------|-------|------|------|------|------|------|-----|------|--------|------|------|-----|--|--|--|--|--|---|----|--|
|    |      |      |       |      |      |      |      |      |     |      |        |      |      |     |  |  |  |  |  |   |    |  |

# **Directory Coherence Protocols**

| Observe: physical address space statically partitioned         |  |
|----------------------------------------------------------------|--|
| + Can easily determine which memory module holds a given line  |  |
| <ul> <li>That memory module sometimes called "home"</li> </ul> |  |

- Can't easily determine which processors have line in their caches
- Bus-based protocol: broadcast events to all processors/caches ± Simple and fast, but non-scalable
- Directories: non-broadcast coherence protocol
  - Extend memory to track caching information
  - For each physical cache line whose home this is, track:
    - Owner: which processor has a dirty copy (I.e., M state)
    - Sharers: which processors have clean copies (I.e., S state)
  - Processor sends coherence event to home directory
    - Home directory only sends events to processors that care

### **MSI Directory Protocol**



### **Directory MSI Protocol** P0 P1 Directory Processor 1 Processor 0 -:-:500 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 S:500 S:0:500 3: sub r4,r2,r4 4: st r4,0(r3) M:400 M:0:500 0: addi r1,accts,r3 5: call spew cash (stale) 1: ld 0(r3),r4 2: blt r4,r2,6 S:400 S:400 S:0,1:400 3: sub r4,r2,r4 4: st r4,0(r3) M:300 M:1:400 5: call spew cash 1d by P1 sends BR to directory Directory sends BR to P0, P0 sends P1 data, does WB, goes to S • st by P1 sends BW to directory Directory sends BW to P0, P0 goes to I CIS 501 (Martin/Roth): Shared Memory Multiprocessors 34

### Directory Flip Side: Latency

| <ul> <li>Directory protocols</li> <li>+ Lower bandwidth consumption → more scalable</li> </ul>                                                         | le                    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------|
| - Longer latencies                                                                                                                                     | <u>3 hop miss</u>     |
| Two read miss situations                                                                                                                               | $(P_0)$               |
| <ul> <li>Unshared block: get data from memory</li> <li>Bus: 2 hops (P0→memory→P0)</li> <li>Directory: 2 hops (P0→memory→P0)</li> </ul>                 | Dir                   |
| <ul> <li>Shared or exclusive block: get data from other</li> <li>Assume cache-to-cache transfer optimizatio</li> <li>Bus: 2 hops (P0→P1→P0)</li> </ul> |                       |
| – Directory: <b>3 hops</b> (P0→memory→P1→P0)                                                                                                           |                       |
| Common, with many processors high probability                                                                                                          | bility someone has it |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                                                   | 35                    |

# Directory Flip Side: Complexity

| •  | Latency not only issue for directories                                            |
|----|-----------------------------------------------------------------------------------|
|    | Subtle correctness issues as well                                                 |
|    | <ul> <li>Stem from unordered nature of underlying inter-connect</li> </ul>        |
| •  | Individual requests to single cache line must appear atom                         |
|    | Bus: all processors see all requests in same order                                |
|    | Atomicity automatic                                                               |
|    | Point-to-point network: requests may arrive in different orders                   |
|    | <ul> <li>Directory has to enforce atomicity explicitly</li> </ul>                 |
|    | <ul> <li>Cannot initiate actions on request B</li> </ul>                          |
|    | <ul> <li>Until all relevant processors have completed actions on reque</li> </ul> |
|    | <ul> <li>Requires directory to collect acks, queue requests, etc.</li> </ul>      |
| •  | Directory protocols                                                               |
|    | Obvious in principle                                                              |
|    | <ul> <li>Extremely complicated in practice</li> </ul>                             |
| CI | S 501 (Martin/Roth): Shared Memory Multiprocessors 36                             |

### **Coherence on Real Machines**

- Many uniprocessors designed with on-chip snooping logic
  - Can be easily combined to form SMPs
  - E.g., Intel Pentium4 Xeon
- Larger scale (directory) systems built from smaller SMPs
  - E.g., Sun Wildfire, NUMA-Q, IBM Summit
- Some shared memory machines are not cache coherent
  - E.g., CRAY-T3D/E
  - Shared data is uncachable
  - If you want to cache shared data, copy it to private data section
  - Basically, cache coherence implemented in software
    - Have to really know what you are doing as a programmer

37

# One Down, Two To Go

- Coherence only one part of the equation
  - Synchronization
  - Consistency

| CIS | EOI | 1 (N | larti | n /D | oth) | <br> | I Ma | <br> | +:. | <br> |  |  |  |  |  | _ | 9 |  |
|-----|-----|------|-------|------|------|------|------|------|-----|------|--|--|--|--|--|---|---|--|

### Best of Both Worlds?

- Ignore processor snooping bandwidth for a minute
- Can we combine best features of snooping and directories?
  - From snooping: fast 2-hop cache-to-cache transfers
  - From directories: scalable point-to-point networks
  - In other words...
- Can we use broadcast on an unordered network?
  - Yes, and most of the time everything is fine
  - But sometimes it isn't ... data race

### Token Coherence (TC)

- An unordered broadcast snooping protocol ... without data races
- Interesting, but won't talk about here

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

38

# The Need for Synchronization

| Proces  | sor O         | Pro | ocessor 1               |            |          |
|---------|---------------|-----|-------------------------|------------|----------|
| 0: add  | i r1,accts,r3 |     |                         | hann an Au | 50       |
| 1: ld ( | 0(r3),r4      |     |                         |            |          |
| 2: blt  | r4,r2,6       | 0:  | addi r1,accts,r3        | S:500      | 50       |
| 3: sub  | r4,r2,r4      | 1:  | ld 0(r3),r4             | S.200      | S:500 50 |
| 4: st : | r4,0(r3)      | 2:  | blt r4,r2,6             | 3.300      | 3.300 30 |
| 5: cal  | l spew_cash   | 3:  | <pre>sub r4,r2,r4</pre> | M:400      | 1: 40    |
|         |               | 4 : | st r4,0(r3)             |            |          |
|         |               | 5:  | call spew cash          | 1:         | M:400 40 |

- We got it wrong ... and coherence had nothing to do with it

### The Need for Synchronization

| Processor 0         | Processor 1         |                 |
|---------------------|---------------------|-----------------|
| 0: addi r1,accts,r3 |                     | 500             |
| 1: ld 0(r3),r4      |                     |                 |
| 2: blt r4,r2,6      | 0: addi r1,accts,r3 | S:500 500       |
| 3: sub r4,r2,r4     | 1: ld 0(r3),r4      | S:500 S:500 500 |
| 4: st r4,0(r3)      | 2: blt r4,r2,6      | 3.300 3.300 300 |
| 5: call spew_cash   | 3: sub r4,r2,r4     | M:400 I: 400    |
|                     | 4: st r4,0(r3)      |                 |
|                     | 5: call spew_cash   | I: M:400 400    |
| • What really hanne | ned?                |                 |

What really happened?

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

- Access to accts [241].bal should conceptually be atomic
  - Transactions should not be "interleaved"
  - But that's exactly what happened
  - Same thing can happen on a multiprogrammed uniprocessor!

41

• Solution: synchronize access to accts [241].bal

| ini | chr | on | ızat | non |
|-----|-----|----|------|-----|
|     |     |    | 1741 |     |
|     |     |    |      |     |

| • Synchronization: second issue for shared memo                                                                                          | orv        |
|------------------------------------------------------------------------------------------------------------------------------------------|------------|
| <ul> <li>Regulate access to shared data</li> </ul>                                                                                       |            |
| <ul> <li>Software constructs: semaphore, monitor</li> </ul>                                                                              |            |
| Hardware primitive: lock                                                                                                                 |            |
| <ul> <li>Operations: acquire (lock) and release (lock)</li> </ul>                                                                        |            |
| <ul> <li>Region between acquire and release is a critic</li> <li>Must interleave acquire and release</li> </ul>                          | al section |
| Second consecutive acquire will fail (actually it will                                                                                   | block)     |
| <pre>struct acct_t { int bal; };<br/>shared struct acct_t accts[MAX_ACCT];<br/>shared int lock;<br/>int id,amt;<br/>acquire(lock);</pre> |            |
| <pre>if (accts[id].bal &gt;= amt) { // critical section<br/>accts[id].bal -= amt;<br/>spew_cash(); }</pre>                               |            |
| release(lock);                                                                                                                           |            |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                                     | 42         |

### Spin Lock Strawman (Does not work)

| <ul> <li>acquire</li> </ul> | (lock): while (loc                                                                                  | k != 0); lock = | 1; |
|-----------------------------|-----------------------------------------------------------------------------------------------------|-----------------|----|
| A0:<br>A1:<br>A2:           | while lock is 1, wait for it<br>ld 0(&lock), r6<br>bnez r6, A0<br>addi r6, 1, r6<br>st r6, 0(&lock) | to turn 0       |    |
| <ul> <li>release</li> </ul> | (lock): lock = 0;                                                                                   |                 |    |
| R0:                         | st r0,0(&lock)                                                                                      | // r0 holds 0   |    |
|                             |                                                                                                     |                 |    |
|                             | bared Memory Multiprocessors                                                                        |                 | 43 |

### Spin Lock Strawman (Does not work)

| A0: ld 0(&lock),r6<br>A1: bnez r6,#A0  | A0: $ld r6,0(\&lock)$                     |
|----------------------------------------|-------------------------------------------|
| A2: addi r6,1,r6                       | A1: bnez r6, $\#$ A0                      |
| A3: st r6,0(&lock)                     | A2: addi r6,1,r6                          |
| CRITICAL_SECTION                       | A3: st r6,0(&lock)                        |
|                                        | CRITICAL_SECTION                          |
|                                        |                                           |
| • Spin lock makes in                   | ntuitive sense, but doesn't actually work |
| <ul> <li>Loads/stores of tw</li> </ul> | vo acquire sequences can be interleaved   |

- Lock acquire sequence also not atomic
- Definition of "squeezing toothpaste"
- Note, release is trivially atomic

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### Better Implementation: SYSCALL Lock

| ACQUIRE_LOCK:                                   |                                |  |
|-------------------------------------------------|--------------------------------|--|
| A0: enable_interrupts                           |                                |  |
| A1: disable_interrupts                          | atomic                         |  |
| A2: ld r6,0(&lock)                              |                                |  |
| A3: bnez r6,#A0                                 |                                |  |
| A4: addi r6,1,r6                                |                                |  |
| A5: st r6,0(&lock)                              |                                |  |
| A6: enable_interrupts                           |                                |  |
| A7: jr \$r31                                    |                                |  |
|                                                 |                                |  |
| - Implement lock in a CV                        | C ALL                          |  |
| <ul> <li>Implement lock in a SYS</li> </ul>     |                                |  |
| <ul> <li>Kernel can control interle</li> </ul>  | eaving by disabling interrupts |  |
| + Works                                         |                                |  |
| <ul> <li>But only in a multi-program</li> </ul> | ammed uni-processor            |  |
|                                                 |                                |  |
| <ul> <li>Hugely expensive in the</li> </ul>     | common case, lock is free      |  |
| CIS 501 (Martin/Roth): Shared Memory Multi      | iprocessors 45                 |  |

### Working Spinlock: Test-And-Set

| <ul> <li>ISA provides an atomic lock acquisition instructio</li> <li>Example: test-and-set</li> </ul> | 1 <del></del> |
|-------------------------------------------------------------------------------------------------------|---------------|
| t&s r1,0(&lock)                                                                                       |               |
| <ul> <li>Atomically executes</li> <li>mov r1,r2</li> <li>ld r1,0(&amp;lock)</li> </ul>                |               |
| st r2,0(&lock)                                                                                        |               |
| • If lock was initially free (0), acquires it (sets it to 1)                                          |               |
| <ul> <li>If lock was initially busy (1), doesn't change it</li> </ul>                                 |               |
| <ul> <li>New acquire sequence</li> <li>A0: t&amp;s r1,0(&amp;lock)</li> <li>A1: bnez r1,A0</li> </ul> |               |
| <ul> <li>Also known as swap, exchange, or fetch-and-add</li> </ul>                                    |               |
|                                                                                                       |               |
| 5 501 (Martin/Roth): Shared Memory Multiprocessors                                                    | 46            |

### Test-and-Set Lock Correctness

| Processor 0         | Processor 1         |  |
|---------------------|---------------------|--|
| A0: t&s r1,0(&lock) |                     |  |
| A1: bnez r1,#A0     | A0: t&s r1,0(&lock) |  |
| CRITICAL_SECTION    | A1: bnez r1,#A0     |  |
|                     | A0: t&s r1,0(&lock) |  |
|                     | A1: bnez r1,#A0     |  |
| + Test-and-set loc  | k actually works    |  |
| Processor 1 kee     | ps spinning         |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |
|                     |                     |  |

### Test-and-Set Lock Performance

| <u>Processor 1</u>  | Processor 2         |     |     |   |
|---------------------|---------------------|-----|-----|---|
| A0: t&s r1,0(&lock) |                     | M:1 | l:  | 1 |
| A1: bnez r1,#A0     | A0: t&s r1,0(&lock) | l:  | M:1 | 1 |
| A0: t&s r1,0(&lock) | A1: bnez r1,#A0     | M:1 | l:  | 1 |
| A1: bnez r1,#A0     | A0: t&s r1,0(&lock) | l:  | M:1 | 1 |
|                     | A1: bnez r1,#A0     | M:1 | 1:  | 1 |

- But performs poorly in doing so
  - Consider 3 processors rather than 2
  - Processor 0 (not shown) has the lock and is in the critical section
  - But what are processors 1 and 2 doing in the meantime?
    - Loops of t&s, each of which includes a st
    - Taking turns invalidating each others cache lines
    - Generating a ton of useless bus (network) traffic

### Test-and-Test-and-Set Locks

| • | Solution: | test-and | -test-and | l-set locks |
|---|-----------|----------|-----------|-------------|
|---|-----------|----------|-----------|-------------|

- New acquire sequence
  - A0: ld r1,0(&lock)
  - A1: bnez r1,A0
  - A2: addi r1,1,r1 A3: t&s r1,0(&lock)
  - A4: bnez r1,A0
- Within each loop iteration, before doing a t&s
  - Spin doing a simple test (1d) to see if lock value has changed
  - Only do a t&s (st) if lock is actually free
- Processors can spin on a busy lock locally (in their own cache)
- Less unnecessary bus traffic

| CIS 501 (Martin/Roth): Shared Memory Multiprocessors | CIS 501 | (Martin/Roth): | Shared | Memory | Multiprocessors |  |
|------------------------------------------------------|---------|----------------|--------|--------|-----------------|--|
|------------------------------------------------------|---------|----------------|--------|--------|-----------------|--|

### Test-and-Test-and-Set Lock Performance

| Processor 1        | Processor 2         |     |     | _ |
|--------------------|---------------------|-----|-----|---|
| A0: ld r1,0(&lock) |                     | S:1 | 1:  | 1 |
| A1: bnez r1,A0     | A0: ld r1,0(&lock)  | S:1 | S:1 | 1 |
| A0: ld r1,0(&lock) | A1: bnez r1,A0      | S:1 | S:1 | 1 |
| // lock relea      | ased by processor 0 | l:  | 1:  | 0 |
| A0: ld r1,0(&lock) | A1: bnez r1,A0      | S:0 | 1:  | 0 |
| A1: bnez r1,A0     | A0: ld r1,0(&lock)  | S:0 | S:0 | 0 |
| A2: addi r1,1,r1   | A1: bnez r1,A0      | S:0 | S:0 | 0 |
| A3: t&s r1,(&lock) | A2: addi r1,1,r1    | M:1 | 1:  | 1 |
| A4: bnez r1,A0     | A3: t&s r1,(&lock)  | l:  | M:1 | 1 |
| CRITICAL_SECTION   | A4: bnez r1,A0      | l:  | M:1 | 1 |
|                    | A0: ld r1,0(&lock)  | l:  | M:1 | 1 |
|                    | A1: bnez r1,A0      | l:  | M:1 | 1 |

### Queue Locks

| <ul> <li>Test-and-test-and-set locks can still perform poorly</li> <li>If lock is contended for by many processors</li> <li>Lock release by one processor, creates "free-for-all" by others</li> <li>Network gets swamped with tes requests</li> </ul>                                                                                                                                                                                      |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Queue lock</li> <li>When lock is released by one processor</li> <li>Directory doesn't notify (by invalidations) all waiting processors</li> <li>Instead, chooses one and sends invalidation only to it <ul> <li>Others continue spinning locally, unaware lock was released</li> </ul> </li> <li>Effectively, directory passes lock from one processor to the next <ul> <li>Greatly reduced network traffic</li> </ul> </li> </ul> |

### Queue Lock Performance

| <u>Processor 1</u>              | Processor 2                         |         |     | _ |
|---------------------------------|-------------------------------------|---------|-----|---|
| A0: ld r1,0(&lock)              |                                     | S:1     | 1:  | 1 |
| A1: bnez r1,A0                  | A0: ld r1,0(&lock)                  | S:1     | S:1 | 1 |
| A0: ld r1,0(&lock)              | A1: bnez r1,A0                      | S:1     | S:1 | 1 |
| // lock relea                   | ased by processor 0                 | 1:      | S:1 | 0 |
| A0: ld r1,0(&lock)              | A1: bnez r1,A0                      | S:0     | 1:  | 0 |
| A1: bnez r1,A0                  | A0: ld r1,0(&lock)                  | S:0     | S:0 | 0 |
| A2: addi r1,1,r1                | A1: bnez r1,A0                      | S:0     | S:0 | 0 |
| A3: t&s r1,(&lock)              |                                     | M:1     | 1:  | 1 |
| A4: bnez r1,A0                  | A0: ld r1,0(&lock)                  | S:1     | S:1 | 1 |
| CRITICAL_SECTION                | A1: bnez r1,A0                      | S:1     | S:1 | 1 |
|                                 | A0: ld r1,0(&lock)                  | S:1     | S:1 | 1 |
|                                 | A1: bnez r1,A0                      | S:1     | S:1 | 1 |
| Processor 0 rele                | ases lock, <b>informs only proc</b> | essor 1 |     |   |
| CIS 501 (Martin/Roth): Shared N | Aemory Multiprocessors              |         | 5   | 2 |

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

49

### A Final Word on Locking

- A single lock for the whole array may restrict parallelism
  - · Will force updates to different accounts to proceed serially
  - Solution: one lock per account
  - Locking granularity: how much data does a lock lock?
  - A software issue, but one you need to be aware of

| <pre>struct acct_t { int bal,lock; };</pre>          |    |
|------------------------------------------------------|----|
| <pre>shared struct acct_t accts[MAX_ACCT];</pre>     |    |
| <pre>int id,amt;<br/>acquire(accts[id].lock);</pre>  |    |
| if $(accts[id], bal >= amt)$ {                       |    |
| accts[id].bal -= amt;                                |    |
| <pre>spew_cash(); }</pre>                            |    |
| <pre>release(accts[id].lock);</pre>                  |    |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors | 53 |
|                                                      |    |

### Memory Consistency

| Memory coherence                                                             |    |
|------------------------------------------------------------------------------|----|
| <ul> <li>Creates globally uniform (consistent) view</li> </ul>               |    |
| <ul> <li>Of a single memory location (in other words: cache line)</li> </ul> |    |
| – Not enough                                                                 |    |
| <ul> <li>Cache lines A and B can be individually consistent</li> </ul>       |    |
| But inconsistent with respect to each other                                  |    |
| Memory consistency                                                           |    |
| <ul> <li>Creates globally uniform (consistent) view</li> </ul>               |    |
| Of all memory locations relative to each other                               |    |
| Who cares? Programmers                                                       |    |
| <ul> <li>Globally inconsistent memory creates mystifying behavior</li> </ul> |    |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                         | 54 |

### Coherence vs. Consistency

| Processor 0        | Processor 1            |  |
|--------------------|------------------------|--|
| A=1;               | while (!flag); // spin |  |
| <pre>flag=1;</pre> | print A;               |  |

- Intuition says: P1 prints A=1
- Coherence says: absolutely nothing
  - P1 can see P0's write of flag before write of A!!! How?
    - Maybe coherence event of  ${\bf A}$  is delayed somewhere in network

55

- Maybe P0 has a coalescing write buffer that reorders writes
- Imagine trying to figure out why this code sometimes "works" and sometimes doesn't
- Real systems act in this strange manner

CIS 501 (Martin/Roth): Shared Memory Multiprocessors

Sequential Consistency (SC)

| A=fl                             | Lag=0 ;                                               |     |
|----------------------------------|-------------------------------------------------------|-----|
| Processor 0                      | Processor 1                                           |     |
| A=1;                             | while (!flag); // spin                                |     |
| flag=1;                          | print A;                                              |     |
| <ul> <li>Sequential</li> </ul>   | consistency (SC)                                      |     |
| • Formal de                      | efinition of memory view programmers expect           |     |
| <ul> <li>Processors</li> </ul>   | see their own loads and stores in program order       |     |
| + Provide                        | ed naturally, even with out-of-order execution        |     |
| • But also: p                    | processors see others' loads and stores in program or | der |
| <ul> <li>And finally:</li> </ul> | : all processors see same global load/store ordering  |     |
|                                  | o conditions not naturally enforced by coherence      |     |
| Lamport de                       | efinition: multiprocessor ordering                    |     |
| <ul> <li>Correspond</li> </ul>   | ds to some sequential interleaving of uniprocessor or | der |
| •                                | tinguishable from multi-programmed uni-               |     |
| processor                        | i<br>Shared Memory Multiprocessors 5                  |     |

### **Enforcing SC**

 What does it take to enforce SC? · Definition: all loads/stores globally ordered • Translation: coherence events of all loads/stores globally ordered • When do coherence events happen naturally? · On cache access • For stores: retirement  $\rightarrow$  in-order  $\rightarrow$  good • No write buffer? Yikes, but OK with write-back D\$ For loads: execution → out-of-order → bad - No out-of-order execution? Double yikes Is it true that multi-prcessors cannot be out-of-order? - That would be really bad • Out-of-order is needed to hide cache miss latency • And multi-processors not only have more misses... • ... but miss handling takes longer (coherence actions) 57 CIS 501 (Martin/Roth): Shared Memory Multiprocessors

### Is SC Really Necessary?

• SC

|    | <ul> <li>+ Most closely matches programmer's intuition (don't under-estimate)</li> <li>- Restricts optimization by compiler, CPU, memory system</li> <li>• Supported by MIPS, HP PA-RISC</li> </ul>                                                                                                                                                                                                                                                     |
|----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|    | <ul> <li>Is full-blown SC really necessary? What about</li> <li>All processors see others' loads/stores in program order</li> <li>But not all processors have to see same global order</li> <li>Allows processors to have in-order write buffers</li> <li>Doesn't confuse programmers too much <ul> <li>Synchronized programs (e.g., our example) work as expected</li> </ul> </li> <li>Processor Consistency (PC): e.g., Intel IA-32, SPARC</li> </ul> |
| CI | S 501 (Martin/Roth): Shared Memory Multiprocessors 59                                                                                                                                                                                                                                                                                                                                                                                                   |

### SC + 000

| • Recall: opportunistic load scheduling in a uni-p                                                                                                                                                                  | rocessor    |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------|
| <ul> <li>Loads issue speculatively relative to older stor</li> </ul>                                                                                                                                                |             |
| <ul> <li>Stores scan for younger loads to same address have</li> <li>Find one? Ordering violation → flush and restart</li> </ul>                                                                                    |             |
| <ul> <li>In-flight loads effectively "snoop" older stores from s</li> </ul>                                                                                                                                         | ame process |
| <ul> <li>SC + OOO can be reconciled using same tech</li> <li>Write bus requests from other processors snoop in-fl</li> <li>Think of MOB as extension of the cache hierarchy</li> <li>MIPS R10K does this</li> </ul> |             |
| • SC implementable, but overheads still remain:                                                                                                                                                                     |             |
| Write buffer issues                                                                                                                                                                                                 |             |
| Complicated Id/st logic                                                                                                                                                                                             |             |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                                                                                                                | 58          |

# Weak Memory Ordering

| Four even where we also as it and succession of                                 |          |
|---------------------------------------------------------------------------------|----------|
| For properly synchronized programs                                              |          |
| <ul> <li>Only acquires/releases must be strictly ordered</li> </ul>             |          |
| <ul> <li>Why? Acquire-release pairs define critical s</li> </ul>                | ections  |
| <ul> <li>Between critical-sections: data is private</li> </ul>                  |          |
| <ul> <li>Globally unordered access OK</li> </ul>                                |          |
| <ul> <li>Within critical-section: access to shared data is exclusive</li> </ul> | re       |
| <ul> <li>Globally unordered access also OK</li> </ul>                           |          |
| <ul> <li>Implication: compiler or dynamic scheduling is OK</li> </ul>           |          |
| <ul> <li>As long as re-orderings do not cross synchronization</li> </ul>        | n points |
| • Weak Ordering (WO): Alpha, IA-64, PowerPC                                     |          |
| <ul> <li>ISA provides fence insns to indicate scheduling barriers</li> </ul>    |          |
| <ul> <li>Proper use of fences is somewhat subtle</li> </ul>                     |          |
| <ul> <li>Use synchronization library, don't write your</li> </ul>               | own      |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                            | 60       |

### SC + 000 vs. W0

### • Big debate these days

- Is SC + OOO equal to WO performance wise?
- And if so, which is preferred?

### • Another hot button issue

- Can OOO be used to effectively speculate around locks?
- Short answer: yes

| TC 501 /M- | ortin/Doth). Ch | ared Memory | Multiprocosso |  | 61 |
|------------|-----------------|-------------|---------------|--|----|

### Multiprocessing & Power Consumption

| Multiprocessing can be very power efficient                                                                                                                                                                            |       |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|
| <ul> <li>Recall: dynamic voltage and frequency scaling</li> <li>Performance vs power is NOT linear</li> <li>Example: Intel's Xscale         <ul> <li>1 GHz → 200 MHz reduces energy used by 30x</li> </ul> </li> </ul> |       |
| <ul> <li>Impact of parallel execution</li> <li>What if we used 5 Xscales at 200Mhz?</li> <li>Similar performance as a 1Ghz Xscale, but 1/6th the er</li> <li>5 cores * 1/30th = 1/6th</li> </ul>                       | iergy |
| <ul> <li>Assumes parallel speedup (a difficult task)</li> <li>Remember Ahmdal's law</li> </ul>                                                                                                                         |       |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                                                                                                                   | 63    |

### Multiprocessors Are Here To Stay

| •  | <ul> <li>Moore's law is making the multiprocessor a commodity part</li> <li>500M transistors on a chip, what to do with all of them?</li> <li>Not enough ILP to justify a huge uniprocessor</li> <li>Really big caches? t<sub>hit</sub> increases, diminishing %<sub>miss</sub> returns</li> </ul> |  |  |
|----|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| •  | Chip multiprocessors (CMPs)                                                                                                                                                                                                                                                                        |  |  |
|    | Multiple full processors on a single chip                                                                                                                                                                                                                                                          |  |  |
|    | • Example: IBM POWER4: two 1GHz processors, 1MB L2, L3 tags                                                                                                                                                                                                                                        |  |  |
| •  | <ul> <li>Multiprocessors a huge part of computer architecture</li> <li>Multiprocessor equivalent of H+P is 50% bigger than H+P</li> </ul>                                                                                                                                                          |  |  |
|    | Another entire cores on multiprocessor architecture                                                                                                                                                                                                                                                |  |  |
|    | Hopefully every other year or so (not this year)                                                                                                                                                                                                                                                   |  |  |
| CI | S 501 (Martin/Roth): Shared Memory Multiprocessors 62                                                                                                                                                                                                                                              |  |  |

# Shared Memory Summary

| <ul> <li>Shared-memory multiprocessors</li> <li>+ Simple software: easy data sharing, handles both I</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | DI P and TI P    |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|
| <ul> <li>Complex hardware: must provide illusion of global</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                  |
| <ul> <li>Two basic implementations</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | address space    |
| Symmetric (UMA) multi-processors (SMPs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |                  |
| Underlying communication network: bus (order                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | red)             |
| + Low-latency, simple protocols that rely on globa                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                  |
| <ul> <li>Low-bandwidth, poor scalability</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                  |
| • Scalable (NUMA) multi-processors (MPPs)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |                  |
| <ul> <li>Underlying communication network: point-to-perturbed to the perturbation of the perturbation</li></ul> | oint (unordered) |
| <ul> <li>Higher-latency, complex protocols</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                  |
| CIS 501 (Martin/Roth): Shared Memory Multiprocessors                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 64               |

# Shared Memory Summary

65

| CIS 501 (Martin/Roth): Shared Memory Multiprocessors |
|------------------------------------------------------|
|                                                      |