Introduction

Introduction

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs Coherence & Snooping Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks! Outline Motivation for Cache-Coherent Shared Memory Snooping Cache Coherence (Chapter 5) Basic systems Design tradeoffs Implementing Snooping Systems (Chapter 6) Advanced Snooping Systems (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 2 What is (Hardware) Shared Memory? Take multiple microprocessors

Implement a memory system with a single global physical address space (usually) Communication assist HW does the magic of cache coherence Goal 1: Minimize memory latency Use co-location & caches Goal 2: Maximize memory bandwidth Use parallelism & caches (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 3 Some Memory System Options P1 Pn Switch (Interleaved) First-level $ P1 Pn $

$ Bus (Interleaved) Main memory I/O devices Mem (a) Shared cache (b) Bus-based shar ed memory P1 Pn $ $ Pn P1 Mem $ Mem

$ Interconnection network Interconnection network Mem Mem (c) Dancehall (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh (d) Distributed-memory ECE 259 / CPS 221 4 Cache Coherence According to Websters dictionary Cache: a secure place of storage Coherent: logically consistent Cache Coherence: keep storage logically consistent Coherence requires enforcement of 2 properties

1) Write propagation All writes eventually become visible to other processors 2) Write serialization All processors see writes to same block in same order (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 5 Why Cache Coherent Shared Memory? Pluses For applications - looks like multitasking uniprocessor For OS - only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first and then performance Minuses Proper synchronization is complex Communication is implicit so may be harder to optimize More work for hardware designers (i.e., us!)

Result Symmetric Multiprocessors (SMPs) are the most successful parallel machines ever And the first with multi-billion-dollar markets! (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 6 In More Detail Efficient naming Virtual to physical mapping with TLBs Ability to name relevant portions of objects Easy and efficient caching Caching is natural and well understood Can be done in HW automatically Low communication overhead Low overhead since protection is built into memory system Easy for HW (i.e., cache controller) to packetize requests and replies Integration of latency tolerance Demand-driven: consistency models, prefetching, multithreading Can extend to push data to PEs and use bulk transfer

(C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 7 Symmetric Multiprocessors (SMPs) Multiple microprocessors Each has a cache (or multiple caches in a hierarchy) Connect with logical bus (totally-ordered broadcast) Physical bus = set of shared wires Logical bus = functional equivalent of physical bus Implement Snooping Cache Coherence Protocol Broadcast all cache misses on bus All caches snoop bus and may act (e.g., respond with data) Memory responds otherwise (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 8 Outline Motivation for Cache-Coherent Shared Memory Snooping Cache Coherence (Chapter 5) Basic systems Design tradeoffs

Implementing Snooping Systems (Chapter 6) Advanced Snooping Systems (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 9 Cache Coherence Problem (Step 1) P2 ld r2, x Time P1 Interconnection Network x Main Memory (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221

10 Cache Coherence Problem (Step 2) P2 Time P1 ld r2, x ld r2, x Interconnection Network x Main Memory (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 11 Cache Coherence Problem (Step 3) P2 Time

P1 ld r2, x ld r2, x add r1, r2, r4 st x, r1 Interconnection Network x Main Memory (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 12 Snooping Cache-Coherence Protocols Bus provides serialization point (more on this later) Each cache controller snoops all bus transactions Transaction is relevant if it is for a block this cache contains Take action to ensure coherence Invalidate Update Supply value to requestor if Owner Actions depend on the state of the block and the protocol Main memory controller also snoops on bus

If no cache is owner, then memory is owner Simultaneous operation of independent controllers (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 13 Snooping Design Choices Controller updates state of blocks in response to processor Processor and snoop events and generates ld/st bus transactions Cache State Tag Often have duplicate cache tags Data ... Snooping protocol Set of states, set of events State-transition diagram Actions

Snoop (observed bus transaction) Basic Choices Write-through vs. write-back Invalidate vs. update Which states to allow (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 14 Simple 2-State Invalidate Snooping Protocol Load / -- Store / OwnBusWr OtherBusRd / -- Valid Load / OwnBusRd Store / OwnBusWr OtherBusWr / -- Write-through, no-writeallocate cache Proc actions:

Load, Store Bus actions: BusRd, BusWr Invalid OtherBusRd / -OtherBusWr / -- Notation: observed event / action taken (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 15 A 3-State Write-Back Invalidation Protocol 2-State Protocol + Simple hardware and protocol Uses lots of bandwidth (every write goes on bus!) 3-State Protocol (MSI) Modified One cache exclusively has valid (modified) copy Owner Memory is stale Shared >= 1 cache and memory have valid copy (memory = owner) Invalid (only memory has valid copy and memory is owner) Must invalidate all other copies before entering modified state

Requires bus transaction (order and invalidate) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 16 MSI Processor and Bus Actions Processor: Load Store Writeback on replacement of modified block Bus Bus Read (BusRd): Read without intent to modify, data could come from memory or another cache Bus Read-Exclusive (BusRdX): Read with intent to modify, must invalidate all other caches copies Writeback (BusWB): cache controller puts contents on bus and memory is updated Definition: cache-to-cache transfer occurs when another cache satisfies BusRd or BusRdX request Lets draw it! (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221

17 MSI State Diagram Store / -- Load /-M OtherBusRd / OwnBusWB Store / OwnBusRdX Store / OwnBusRdX S Load / OwnBusRd I OtherBusRdX / OwnBusWB Writeback / OwnBusWB OtherBusRdX / -Writeback / -Load / -OtherBusRd / -- Note: we never take any action on an OtherBusWB (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 18 An MSI Protocol Example

Proc Action P1 State initially I 1. P1 load u IS 2. P3 load u S 3. P3 store u SI 4. P1 load u IS 5. P2 load u S P2 state I I I I I IS P3 state I I

IS SM MS S Bus Act Data from BusRd Memory BusRd Memory BusRdX Memory or P1 (?) BusRd P3s cache BusRd Memory Single writer, multiple reader protocol Why Modified to Shared in line 4? What if not in any cache? Memory responds Read then Write produces 2 bus transactions Slow and wasteful of bandwidth for a common sequence of actions (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221

19 4-State (MESI) Invalidation Protocol Often called the Illinois protocol Modified (dirty) Exclusive (clean unshared) only copy, not dirty Shared Invalid Requires shared signal to detect if other caches have a copy of block IS if shared signal is asserted by any other cache IE otherwise EM transition doesnt require bus transaction Why is this a good thing? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 20 More Generally: MOESI

[Sweazey & Smith, ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid ownership O Variants MSI MESI MOSI MOESI (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

S validity M E exclusiveness I ECE 259 / CPS 221 21 4-State Write-back Update Protocol Dragon (Xerox PARC) States Exclusive (E): one copy, clean, memory is up-to-date Shared-Clean (SC): could be two or more copies, memory unknown Shared-Modified (SM): could be two or more copies, memory stale Modified (M) On a write, send new data to update other caches Adds BusUpdate Transaction Do not confuse Update with Upgrade!

Adds Cache Controller Update operation Must obtain bus before updating local copy Bus is serialization point important for consistency (later ) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 22 Tradeoffs in Protocol Design (C&S 5.4) Design issues: Which state transitions What types of bus transactions Cache block size Cache associativity Write-back vs write-through caching Etc. Methodology: count protocol state transitions Can then compute bandwidth, miss rates, etc.

Results depend on workload (diff workload diff transition rates) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 23 Computing Bandwidth Why bandwidth? Monitor state transitions Count bus transactions I know how many bytes each bus transaction requires (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 24 MESI State Transitions and Bandwidth FROM/TO NP I NP

-- -- I -- -- E -- -- S -- -- M BusWB BusWB 6 + 64 6+64 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

E S BusRd BusRd 6+64 6+64 BusRd BusRd 6+64 6+64 --NA NA ECE 259 / CPS 221 M BusRdX 6+64 BusRdX 6+64 -- -- BusUpgr 6 BusWB -6 + 64

25 Study #1: Bandwidth of MSI vs. MESI X MIPS/MFLOPS processor Use with measured state transition counts to obtain transitions/sec Compute state transitions/sec Compute bus transactions/sec Compute bytes/sec What is BW savings of MESI over MSI? Difference between protocols is Exclusive state MSI requires extra BusUpgr for EM transition Result is very small benefit! Why? Small number of EM transitions BusUpgr consumes only 6 bytes on bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 26

Study #2: MSI BusUpgrd vs. BusRdX MSI SM transition issues BusUpgrd Could have block invalidated while waiting for BusUpgrd response Adds complexity to detect this Could instead just issue BusRdX Pretend like block is in Invalid state BusUpgrd achieves ~ 10% to 20% improvement Compared to just issuing BusRdX Application dependent (as are all studies in architecture!) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 27 Cache Block Size Block size is unit of transfer and of coherence Doesnt have to be, could have coherence smaller [Goodman] Uniprocessor 3Cs Compulsory, Capacity, Conflict Shared memory adds Coherence Miss Type (4th C) True Sharing miss fetches data written by another processor False Sharing miss results from independent data in same coherence block

Increasing block size Usually fewer 3C misses but more bandwidth Usually more false sharing misses P.S. on increasing cache size Usually fewer capacity/conflict misses (& compulsory dont matter) No effect on true/false coherence misses (so may dominate) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 28 Study #3: Invalidate vs. Update Pattern 1: for i = 1 to k P1(write, x); P2 to PN-1(read, x); end for i // one write before reads Pattern 2: for i = 1 to k for j = 1 to m P1(write, x); end for j

P2(read, x); end for i (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh // many writes before reads ECE 259 / CPS 221 29 Invalidate vs. Update, cont. Pattern 1 (one write before reads) N = 16, M = 10, K = 10 Update Iteration 1: N regular cache misses (70 bytes) Remaining iterations: update per iteration (14 bytes; 6 cntrl, 8 data) Total Update Traffic = 16*70 + 9*14 = 1,246 bytes Book assumes 10 updates instead of 9 Invalidate Iteration 1: N regular cache misses (70 bytes) Remaining: P1 generates upgrade (6), 15 others Read miss (70) Total Invalidate Traffic = 16*70 + 9*6 + 15*9*17 = 10,624 bytes Pattern 2 (many writes before reads) Update = 1,400 bytes Invalidate = 824 bytes

(C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 30 Invalidate vs. Update, cont. What about real workloads? Update can generate too much traffic Must selectively limit it Current assessment Update very hard to implement correctly (because of consistency discussion coming in a couple weeks) Rarely done Future assessment May be same as current or Chip multiprocessors may revive update protocols More intra-chip bandwidth Easier to have predictable timing paths? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 31 Outline

Motivation for Cache-Coherent Shared Memory Snooping Cache Coherence (Chapter 5) Implementing Snooping Systems (Chapter 6) Advanced Snooping Systems (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 32 Review: Symmetric Multiprocesors (SMP) Multiple (micro-)processors Each has cache (today a cache hierarchy) Connect with logical bus (totally-ordered broadcast) Implement Snooping Cache Coherence Protocol Broadcast all cache misses on bus All caches snoop bus and may act Memory responds otherwise (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 33 Review: MSI State Diagram Store / --

Load /-M OtherBusRd / OwnBusWB Store / OwnBusRdX Store / OwnBusRdX S Load / OwnBusRd I OtherBusRdX / OwnBusWB Writeback / OwnBusWB OtherBusRdX / -Writeback / -Load / -OtherBusRd / -- Note: we never take any action on an OtherBusWB (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 34 Implementation Issues How does memory know another cache will respond so it doesnt have to? Is it okay if a cache miss is not an atomic event (check tags, queue for bus, get bus, etc.)?

What about L1/L2 caches & split transactions buses? Is deadlock a problem? What happens on a PTE update with multiple TLBs? Can one use virtual caches in SMPs? This is why they pay architects the big bucks! (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 35 Outline for Implementing Snooping Coherence Control Implementation Writebacks, Non-Atomicity Hierarchical Caches Split Buses Deadlock, Livelock, & Starvation Three Case Studies TLB Coherence Virtual Cache Issues (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 36 Snooping SMP Design Goals

Goals Correctness High performance Simple hardware (reduced complexity & cost) Conflicts between goals High performance multiple outstanding low-level events more complex interactions more potential correctness bugs (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 37 Base Cache Coherence Design Single-level write-back cache Invalidation protocol One outstanding memory request per processor Atomic memory bus transactions No interleaving of transactions Atomic operations within a process

One finishes before next in program order Now, were going to gradually add complexity Why? Faster latencies and higher bandwidths! (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 38 Cache Controllers and Tags On a miss in a uniprocessor: Assert request for memory bus Wait for bus grant Drive address and command lines Wait for command to be accepted by relevant device Transfer data In snoop-based multiprocessor, cache controller must: Monitor bus and serve processor Can view as two controllers: bus-side, and processor-side With single-level cache: dual tags (not data) or dual-ported tag RAM

Synchronize tags on updates Respond to bus transactions when necessary (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 39 Reporting Snoop Results: How? Collective response from caches must appear on bus Wired-OR signals Shared: asserted if any cache has a copy Dirty/Inhibit: asserted if some cache has a dirty copy Dont need to know which, since it will do whats necessary Snoop-valid: asserted when OK to check other two signals May require priority scheme for cache-to-cache transfers Which cache should supply data when in shared state? Commercial implementations allow memory to provide data (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 40

Reporting Snoop Results: When? Memory needs to know what, if anything, to do Static delay: fixed number of clocks from address appearing on bus Dual tags required to reduce contention with processor Still must be conservative (update both on write: E -> M) Pentium Pro, HP servers, Sun Enterprise (pre E-10K) Variable delay Memory assumes cache will supply data until all say sorry Less conservative, more flexible, more complex Memory can fetch data early and hold (SGI Challenge) Immediately: Bit-per-block state in memory HW complexity in commodity main memory system (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 41 Writebacks Must allow processor to proceed on a miss Fetch the block Perform writeback later Need writeback buffer Must handle bus transactions in writeback buffer

Snoop writeback buffer Must care about the order of reads and writes Affects the memory consistency model (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 42 Base Organization P Addr Cmd Data Busside controller Tags and state for snoop Cache data RAM

Comparator Tag Addr To controller Write-back buffer Comparator Snoop state Processorside controller Tags and state for P To controller Cmd Data buffer

Addr Cmd System bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 43 Optimization #1: Non-Atomic State Transitions Operations involve multiple actions Look up cache tags Bus arbitration Check for writeback Even if bus is atomic, overall set of actions is not Race conditions among multiple operations Suppose P1 and P2 attempt to write cached block A Each decides to issue BusUpgr to upgrade from S > M

Issues Handle requests for other blocks while waiting to acquire bus Must handle requests for this block A (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 44 Non-Atomicity Transient States Two types of states Stable (e.g. MOESI) Transient or Intermediate Increases complexity PrRd/ PrWr/ M BusRd/Flush BusRdX/Flush PrWr/ BusGrant/BusUpgr BusGrant/BusRdX

BusGrant/ BusRd (S ) S M PrWr/ BusReq BusRdX/Flush I M E BusRd/Flush PrRd/ BusRdX/Flush S BusGrant/ BusRd (S) I S,E BusRdX/Flush PrRd/ BusRd/Flush PrRd/BusReq

PrWr/BusReq I (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 45 Optimization #2: Multi-level Cache Hierarchies How to snoop with multi-level caches? Independent bus snooping at every level? Maintain cache inclusion? Requirements for Inclusion Data in higher-level is subset of data in lower-level Modified in higher-level marked modified in lower-level Now only need to snoop lowest-level cache If L2 says not present (modified), then not so in L1 Is inclusion automatically preserved? Replacements: all higher-level misses go to lower level Modifications (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

ECE 259 / CPS 221 46 Violations of Inclusion The L1 and L2 may choose to replace different block Differences in reference history Set-associative first-level cache with LRU replacement Split higher-level caches Instr & data blocks go in different caches at L1, but collide in L2 What if L2 is set-associative? Differences in block size But a common case works automatically L1 direct-mapped, L1 has fewer sets than L2, and L1 and L2 have same block size (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 47 Inclusion: To Be or Not To Be Most common inclusion solution Ensure L2 holds superset of L1I and L1D On L2 replacement or coherence request that must source data or invalidate, forward actions to L1 caches

Can maintain bits in L2 cache to filter some actions from forwarding But inclusion may not be ideal Restricted associativity in unified L2 can limit blocks in split L1s Not that hard to always snoop L1s If L2 isnt much bigger than L1, then inclusion is wasteful Thus, many new designs dont maintain inclusion Exclusion: no block is in more than any one cache Not Inclusive != Exclusive and Not Exclusive != Inclusive (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 48 Optimization #3: Split-transaction (Pipelined) Bus Supports multiple simultaneous transactions Higher throughput!! (perhaps worse latency) Atomic Transaction Bus Req Delay Response Time Split-transcation Bus

(C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 49 Potential Problems Two transactions to same block (conflicting) Mid-transaction snoop hits E.g., in S, going to M, observe BusRdX Buffering requests and responses Need flow control to prevent deadlock Ordering of snoop responses When does snoop response appear with respect to data response? (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 50 One Solution (like the SGI PowerPath-2) NACK (Negative ACKnowledgment) for flow control Snooper can nack a transaction if it cant buffer it Out-of-order responses

Snoop results presented with data response Disallow multiple concurrent transactions to one line Not necessary, but it can improve designer sanity (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 51 Serialization Point in Split Transaction Buses Is the bus still the serialization point? Yes! When a request wins the bus, it is serialized (unless nacked) Data and snoop response can show up way later Snoop decisions are made based on whats been serialized Example (allows multiple outstanding to same block) Initially: block B is in Invalid in all caches P1 issues BusRdX for B, waits for bus

P2 issues BusRdX for B, waits for bus P2s request wins the bus (but no data from memory until later) P1s request wins the bus who responds? P2 will respond, since P2 is the owner (even before data arrives!) P2 receives data from memory P2 sends data to P1 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 52 A More General Split-transaction Bus Design 4 Buses + Flow Control and Snoop Results Command (type of transaction) Address Tag (unique identifier for response) Data (doesnt require address) Forms of transactions BusRd, BusRdX (request + response) Writeback (request + data) Upgrade (request only)

Per Processor Request Table Tracks All Transactions (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 53 Multi-Level Caches with Split Bus Processor Response Processor Processor request L1 $ L1 $ 8 Response/ request from L2 to L1 1 Response/

request from L1 to L2 5 4 L2 $ Response/ request from bus 7 L2 $ 2 Request/response to bus 3 6 Bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

ECE 259 / CPS 221 54 Multi-level Caches with Split-Transaction Bus General structure uses queues between Bus and L2 cache L2 cache and L1 cache Many potential deadlock problems Classify all messages to break cyclic dependences Requests only generates responses Responses dont generate any other messages Requestor guarantees space for all responses Use separate request and response queues (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 55 More on Correctness Partial correctness (never wrong): Maintain coherence and consistency Full correctness (always right): Prevent: Deadlock:

All system activity ceases Cycle of resource dependences Livelock: B A No processor makes forward progress Constant on-going transactions at hardware level E.g. simultaneous writes in invalidation-based protocol Starvation: Some processors make no forward progress E.g. interleaved memory system with NACK on bank busy (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 56 Deadlock, Livelock, Starvation Deadlock: Can be caused by request-reply protocols When issuing requests, must service incoming transactions E.g., cache awaiting bus grant must snoop & flush blocks Else may not respond to request that will release bus: deadlock Livelock:

Window of vulnerability problem [Kubi et al., MIT] Handling invalidations between obtaining ownership & write Solution: dont let exclusive ownership be stolen before write Starvation: Solve by using fair arbitration on bus and FIFO buffers (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 57 Deadlock Avoidance Responses are never delayed by requests waiting for a response Responses are guaranteed to be sunk Requests will eventually be serviced since the number of responses is bounded by the number of outstanding requests Must classify messages according to deadlock and coherence semantics If type 1 messages (requests) spawn type 2 messages (responses), then type 2 messages cant be allowed to spawn type 1 messages More generally, must avoid cyclic dependences with messages We will see that directory protocols often have 3 message types Request, ForwardedRequest, Response (C) 2004 Daniel J. Sorin from Adve,

Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 58 HPPI Graphics SCSI-2 VME-64 SGI Challenge Overview R4400 CPUs and caches Interleaved memory: 16 GB maximum I/O subsystem (a) A four-processor board Powerpath-2 bus (256 data, 40 address, 47.6 MHz) (b) Machine organization 36 MIPS R4400 (peak 2.7 GFLOPS, 4 per board) or 18 MIPS R8000 (peak 5.4 GFLOPS, 2 per board)

8-way interleaved memory (up to 16 GB) 1.2 GB/s Powerpath-2 bus @ 47.6 MHz, 16 slots, 329 signals 128 Bytes lines (1 + 4 cycles) Split-transaction with up to 8 outstanding reads All transactions take five cycles Miss latency nearly 1 us (mostly on CPU board, not bus) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 59 A-chip L2 $ L2 $ MIPS R4400 MIPS R4400 MIPS R4400 MIPS R4400

CC-chip D-chip slice 1 CC-chip D-chip slice 2 CC-chip D-chip slice 3 Duplicate tags L2 $ Duplicate tags L2 $ Duplicate tags Duplicate

tags Processor and Memory Systems CC-chip D-chip slice 4 Powerpath-2 bus 4 MIPS R4400 processors per board share A / D chips A chip has address bus interface, request table, control logic CC chip per processor has duplicate set of tags Processor requests go from CC chip to A chip to bus 4 bit-sliced D chips interface CC chip to bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 60

SGI Powerpath-2 Bus Non-multiplexed (i.e., separate A and D), 256-data/40address, 47.6 MHz, 8 outstanding requests Wide more interface chips so higher latency, but more bandwidth at slower clock Large block size also calls for wider bus Uses Illinois MESI protocol (cache-to-cache sharing) More detail in chapter At least one requestor 2. Resolution No requestors 1. Arbitration 5. Acknowledge (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh 3. Address 4. Decode ECE 259 / CPS 221 61

Bus Design and Request-Response Matching Essentially two separate buses, arbitrated independently Request bus for command and address Response bus for data Out-of-order responses imply need for matching request with corresponding response Request gets 3-bit tag when wins arbitration (8 outstanding max) Response includes data as well as corresponding request tag Tags allow response to not use address bus, leaving it free Separate bus lines for arbitration and for snoop results (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 62 Bus Design (continued) Each of request and response phase is 5 bus cycles Response: 4 cycles for data (128 bytes, 256-bit bus), 1 turnaround Request phase: arbitration, resolution, address, decode, ack Request-response transaction takes 3 or more of these Time Arb

Rslv Addr Dcd Ack Address Addr Grant Addr bus req Addr ack Data arbitration Arb Rslv Addr Dcd Ack Addr req Addr Data req Tag check

Rslv Addr Dcd Ack Addr ack Data req D0 Data bus Arb D1 Tag check D2 D3 D0 Read operation 1 Read operation 2 Cache tags looked up in decode; extend ack cycle if not possible

Determine who will respond, if any Actual response comes later, with re-arbitration Write-backs have request phase only: arbitrate both data+addr buses (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 63 Bus Design (continued) Flow-control through negative acknowledgement (NACK) No conflicting requests for same block allowed on bus 8 outstanding requests total, makes conflict detection tractable Eight-entry request table in each cache controller New request on bus added to all at same index, determined by tag Entry holds address, request type, state in that cache (if determined already), ... All entries checked on bus or processor accesses for match, so fully associative Entry freed when response appears, so tag can be reassigned by bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 64

Bus Interface with Request Table Request + response queue Snoop state from $ Request buffer Miscellaneous information My response Originator Address Tag 0 Issue + merge check Response

queue Data to/from $ 7 Request table Snoop state Addr + cmd Tag To control Data buffer Addr + cmd Responses Comparator Write-back buffer Write backs Tag Tag

Addr + cmd bus Data + tag bus (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 65 Memory Access Latency 250ns access time from address on bus to data on bus But overall latency seen by processor is 1000ns! 300 ns for request to get from processor to bus Down through cache hierarchy, CC chip and A chip 400ns later, data gets to D chips 3 bus cycles to address phase of request transaction, 12 to access main memory, 5 to deliver data across bus to D chips 300ns more for data to get to processor chip Up through D chips, CC chip, and 64-bit wide interface to processor chip, load data into primary cache, restart pipeline (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221

66 Challenge I/O Subsystem HIO Peripheral HIO SCSI HIO VME HIO HPPI HIO graphics Personality ASICs HIO bus (320 MB/s) Address map Address Datapath System bus to HIO bus

interface System address bus System data bus (1.2 GB/s) Multiple I/O cards on sys bus, each w/320MB/s HIO bus Personality ASICs connect these to devices (standard and graphics) Proprietary HIO bus 64-bit multiplexed address/data, split read trans., up to 4 per device Pipelined, but centralized arbitration, with several transaction lengths Address translation via mapping RAM in system bus interface I/O board acts like a processor to memory system (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 67 SUN Enterprise 6000 Overview P $ P $ $2

$2 CPU/Mem Cards mem ctrl Bus Interface Bus Interface / Switch I/O Cards GigaplaneTM bus (256 data, 41 address, 83 MHz) Up to 30 UltraSPARC processors, MOESI protocol GigaplaneTM bus has peak bw 2.67 GB/s, 300 ns latency Up to 112 outstanding transactions (max 7 per board) 16 bus slots, for processing or I/O boards 2 CPUs and 1GB memory per board Memory distributed, but protocol treats as centralized (UMA) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 68 Sun Gigaplane Bus Non-multiplexed, split-transaction, 256-data/41address, 83.5 MHz (plus 32 ECC lines, 7 tag, 18

arbitration, etc. Total 388) Cards plug in on both sides: 8 per side 112 outstanding transactions, up to 7 from each board Designed for multiple outstanding transactions per processor Emphasis on reducing latency, unlike Challenge Speculative arbitration if address bus not scheduled from prev. cycle Else regular 1-cycle arbitration, and 7-bit tag assigned in next cycle Snoop result associated with request (5 cycles later) Main memory can stake claim to data bus 3 cycles into this, and start memory access speculatively Two cycles later, asserts tag bus to inform others of coming transfer MOESI protocol (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 69 Gigaplane Bus Timing Address 0 1

Rd A Tag 2 3 4 Rd B State Arbitration 5 6 7 8 9 4,5 A D

A Tag Status A 13 14 D A D A D A 7 A Tag D

A D Tag Cancel D0 ECE 259 / CPS 221 D Tag OK Data (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh 12 Own 6 2 D 11

Tag Share ~Own 1 10 D1 70 Enterprise Processor and Memory System 2 procs / board, ext. L2 caches, 2 mem banks w/ x-bar Data lines buffered through UDB to drive internal 1.3 GB/ s UPA bus Wide path to memory so full 64-byte line in 2 bus cycles FiberChannel module (2) Memory (16 72-bit SIMMS) L2 $ Tags L2 $ UltraSparc

UDB UltraSparc SBUS 25 MHz 72 Address controller Data 288 Control Address Data controller (crossbar) Data 288 Gigaplane connector Gigaplane connector (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

SysIO 576 Data controller (crossbar) Address 64 SysIO UDB Address controller Control Fast wide SCSI Tags 144 D-tags 10/100 Ethernet SBUS slots ECE 259 / CPS 221

71 Enterprise I/O System I/O board has same bus interface ASICs as processor boards But internal bus half as wide, and no memory path Only cache block sized transactions, like processing boards Uniformity simplifies design ASICs implement single-block cache, follows coherence protocol Two independent 64-bit, 25 MHz Sbuses One for two dedicated FiberChannel modules connected to disk One for Ethernet and fast wide SCSI Can also support three SBUS interface cards for arbitrary peripherals Performance and cost of I/O scale with # of I/O boards (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 72 Sun Enterprise 10000 (aka E10K or Starfire) How far can you go with snooping coherence? Quadruple request/snoop bandwidth using four logical address buses

Each handles 1/4 of physical address space Impose logical ordering for consistency: for writes on same cycle, those on bus 0 occur before bus 1, etc. Get rid of data bandwidth problem: use a network E10k uses 16x16 crossbar between CPU boards & memory boards Each CPU board has up to 4 CPUs: max 64 CPUs total 10.7 GB/s max BW, 468 ns unloaded miss latency We will discuss a paper on E10K later (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 73 Outline for Implementing Snooping Coherence Control Implementation Writebacks, Non-Atomicity, & Serialization/Order Hierarchical Cache Split Buses Deadlock, Livelock, & Starvation Three Case Studies TLB Coherence Virtual Cache Issues (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh

ECE 259 / CPS 221 74 Translation Lookaside Buffer Cache of Page Table Entries Page table maps virtual page to physical frame Virtual Address Space Physical Address Space 0 3 4 4 7 7 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 75 The TLB Coherence Problem Since TLB is a cache, must be kept coherent Change of PTE on one processor must be seen by all processors

Why might a PTE be cached in more than 1 TLB? Actual sharing of data Process migration Changes are infrequent Get OS to do it (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 76 TLB Shootdown To modify TLB entry, modifying processor must LOCK page table, Flush TLB entries, Queue TLB operations, Send inter-processor interrupt, Spin until other processors are done UNLOCK page table

SLOW! But most common solution today (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 77 TLB Shootdown Improvements Evolutionary Changes Keep track of which processor even had the mapping & only shoot them down Defer shootdowns on upgrade changes (e.g., page from read-only to read-write) SGI Origin poison bit for also deferring downgrades Revolutionary Changes Invalidate TLB entry instruction (e.g., PowerPC) No TLB (e.g., Berkeley SPUR) Use virtual L1 caches so address translation only on miss On miss, walk PTE (which will often be cached normally) PTE changes kept coherent by normal cache coherence (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221

78 Virtual Caches & Synonyms Problem Synonyms: V0 & V1 map to P1 When doing coherence on block in P1, how do you find V0 & V1? Dont do virtual caches (most common today) Dont allow synonyms Probably use a segmented global address space E.g., Berkeley SPUR had process pick 4 of 256 1GB segments Still requires reverse address translation Allow virtual cache & synonyms How do we implement reverse address translation? See Wang et al. next (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 79 Wang et al. [ISCA89] Basic Idea

Extended Goodman one-level cache idea [ASPLOS87] Virtual L1 and physical L2 Do coherence on physical addresses Each L2 block maintains pointer to corresponding L1 block (if any) (requires log2 #L1_blocks - log2 (page_size / block_size) Never allow block to be simultaneously cached under synonyms Example where V0 & V1 map to P2 Initially V1 in L1 and P2 in L2 points to V1 Processor references V0 L1 miss L2 detects synonym in L1 Change L1 tag and L2 pointer so that L1 has V0 instead of V1 Resume (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 80

Virtual Caches & Homonyms Homonym Pool of water and pool the game V0 of one process maps to P2, while V0 of other process maps to P3 Flush cache on context switch Simple but performs poorly Address-space IDs (ASIDs) In architecture & part of context state Mapping-valid bit of Wang et al. Add mapping-valid as a second valid bit on L1 cache block On context switch do flash clear of mapping-valid bits Interesting case is valid block with mapping invalid On processor access, re-validate mapping On replacement (i.e., writeback) treat as valid block (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 81 Outline for Implementing Snooping Coherence Control Implementation Writebacks, Non-Atomicity, & Serialization/Order Hierarchical Cache Split Buses

Deadlock, Livelock, & Starvation Three Case Studies TLB Coherence Virtual Cache Issues (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 82 Outline Motivation for Cache-Coherent Shared Memory Snooping Cache Coherence Implementing Snooping Systems Advanced Snooping Systems Sun UltraEnterprise 10000 Multicast Snooping (Wisconsin) (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 83 Sun UltraEnterprise 10000 (Starfire) Shared-wire bus is bottleneck in snooping systems Tough to implement at high speed Centralized shared resource

Solution: multiple logical buses PRESENTATION (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 84 Multicast Snooping Bus is bottleneck in snooping systems But why broadcast requests when we can multicast? PRESENTATION (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 85 Outline Motivation for Cache-Coherent Shared Memory Snooping Cache Coherence Implementing Snooping Systems Advanced Snooping Systems

(C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 86

Recently Viewed Presentations

  • Synthesis of Salicylic Acid - Valencia College

    Synthesis of Salicylic Acid - Valencia College

    Electronegative oxygen is protonated via prior created hydronium, forming carboxylic acid and reforming alcohol. We'd like to give special thanks to Professor Becker for assistance with mechanism. Thanks to Simon Tang from the tutoring center for help with layout of...
  • Groundwater Hydraulics - Arjumand Zaidi

    Groundwater Hydraulics - Arjumand Zaidi

    Groundwater Flow. Groundwater flow: through the rock and soil layers of the earth until it discharges as a spring, or as seepage into a pond, lake, stream, river or ocean. Similar to drainage basin for surface water, in groundwater hydrology...
  • Black Hills Corporation Wyoming Natural Gas Pipeline Authority

    Black Hills Corporation Wyoming Natural Gas Pipeline Authority

    Wyoming Natural Gas Pipeline Authority October 28, 2003 ... Production Forecasts Fund Position Weather Forecasts Power Prices & Operations Storage Inventory Pipeline Rates and Operations Crude Oil Prices Consumer Pain Threshold Rig Count Report (long term) Reserve Replacement Reports (long...
  • Multivariate Analysis: Theory and Geometric Interpretation

    Multivariate Analysis: Theory and Geometric Interpretation

    SOD Geometric Interpretation. Given a cloud of trajectory points, we identify its center of mass. Then a point and its velocity can be visualized by two vectors in the figure. Now we identify the first SPM by maximizing the ratio...
  • Chapter

    Chapter

    MIDI port is a 5-pin DIN port that looks like PS/2 keyboard port (only larger) Way to connect a musical instrument to PC. MIDI to MIDI, MIDI to USB, USB to USB, and USB to MIDI. A+ Guide to Hardware,...
  • UML Use Cases - NASA

    UML Use Cases - NASA

    UML Use Case Diagram Fowler, M., & Scott, K., UML Distilled: Applying the Standard Object Modeling Language, Addison-Wesley, 1997. Research Problem Research has been conducted on writing effective software requirements in a natural language and has resulted in the development...
  • Heuristic Evaluation

    Heuristic Evaluation

    Title: Heuristic Evaluation Author: David Redmiles Last modified by: David Redmiles Created Date: 4/15/2008 9:48:12 PM Document presentation format
  • Essential Question: What are the similarities and differences ...

    Essential Question: What are the similarities and differences ...

    Governments of Latin America. Before we begin our studies of specific governments of Latin America, let's review some important concepts. ... If you or your students are not familiar with The Hunger Games, just skip to pictures of the actors...