Thermal Management Issues (MICRO-35 Tutorial)

Thermal Management Issues (MICRO-35 Tutorial)

ISCA 2004 Tutorial Thermal Issues for Temperature-Aware Computer Systems Saturday, June 19th 8:00am - 5:00pm 1 Presenters: Kevin Skadron ([email protected]) CS Department, University of Virginia David Brooks ([email protected]) CS Department, Harvard University Antonio Gonzalez ([email protected]) UPC-Barcelona, and Intel Barcelona Research Center Lev Finkelstein ([email protected]) Intel Haifa Mircea Stan ([email protected])

ECE Department, University of Virginia 2 Overview 1. 2. 3. 4. 5. 6. 7. 8. 9. Motivation (Kevin) 1.5 hrs Thermal issues (Kevin) Power modeling (David) 1.5

Thermal management (David) hrs Optimal DTM (Lev) .5 hrs Clustering (Antonio) 1 hr Power distribution (David) 15 min What current chips do (Lev) 45 min HotSpot and sensors (Kevin) 1 hr 3 Overview 1. 2. 3. 4. 5. 6. 7. 8.

9. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin) 4 Motivation Power consumption: first-order design constraint

unconstrained power is a theoretical max peak (inst.) power is limiting power delivery sustained power limits thermal design/packaging max sustained power: thermal virus same as thermal design power average active power and idle power limit mobile battery life, etc. Common fallacy: instantaneous power temperature Power-density is increasing exponentially Unfortunate corollary of Moores Law thermal effects become more problematic Need Power/Temperature-aware computing! 5

Power Dissipation Processor Alpha 21364 Clock 1.15 GHz Rate Power 110W (Max) AMD Opteron 2.2 GHz HPIBMPA8700 Power 4 870 MHz 1.7 GHz Intel

Itanium 2 1.5 GHz Intel Xeon 3.2 GHz MIPS R14000 600 MHz 86 W 75W 130W 86W

16W 100W Source: Microprocessor Report 6 Effects of Technology Scaling on Power Dissipation Feature size is scaling down 30% Frequency is increasing ~2x Area increases due to microarchitecture improvements 25% (Ideal scaling: decreases by 50%) Active capacitance increases

at least 30% (Ideal scaling: decreases by 30%) Vdd is not scaled down at the same rate as feature size 0-10% (Ideal scaling: 30%) Ideal scaling: P CV2f 0.72 reduction 0.5 Observed scaling 2 2.5x increase Power density becomes a problem! Especially since the power density is non-uniform 7 Trends in Power Density 1000 Rocket Nozzle Watts/cm 2

Nuclear Reactor 100 Pentium 4 Pentium III Pentium II Hot plate 10 Pentium Pro Pentium i386 i486 1

* New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies Fred Pollack, Intel Corp. Micro32 conference key note - 1999. 8 ITRS Projections Year Tech node (nm) Vdd (high perf) (V) Vdd (low power) (V) Frequency (high perf) (GHz) High-perf w/ heatsink Cost-performance Hand-held 2003 100 1.0 1.1 3.1

160 85 3.2 2006 2010 70 45 0.9 0.6 1.0 0.8 5.6 11.5 Max power (W) 180 218 98 120

3.5 3.0 2013 32 0.5 0.7 19.3 2016 22 0.4 0.6 28.8 251 138 3.0

288 158 3.0 ITRS 2001 These are targets Power-density problem is still getting worse Intel papers suggest that in the 45-75W range, cooling costs $1/W; but then rate of increase goes up: $2, $3/ W, probably more! (Borkar, IEEE Micro 99, Gunther et al, ITJ 01) 9 Leakage Power The fraction of leakage power is increasing exponentially with each generation Also exponentially dependent on temperature

Increasing ratio across generations 70 60 50 40 373 368 363 358 353

348 343 338 333 328 323 318 313 308

0 303 30 20 10 298 P e rc e n ta g e Static power/ Dynamic Power Temperature(K) 180nm 130nm

100nm 90nm Source: Sankaranarayanan et al, University of Virginia 80nm 70nm 10 Power-aware figures of merit Power (P): battery time (mobile) (1/W) packaging (high-performance) Energy (PD): battery life (mobile)

(MIPS/W) fundamental limits (kT) Energy-delay (PD2): (MIPS2/W) performance and low power Energy-delay2 (PD3): indep. of Vdd (MIPS3/W) emphasis on performance Power-aware low power Similar to old VLSI complexity (A,AD,AD^2) None of these are appropriate for thermal This is a problem Refs: R. Gonzales et al. Supply and threshold voltage scaling for low power CMOS, JSSC, Aug. 1997 A. Martin et al. Design of an Asynchronous MIPS R3000, ARVLSI97 J. Ullman, Computational aspects of VLSI, CS Press, 1984

11 Cooking-aware computing Some chips rated for 100C+ 12 Power and temperature are BAD and can be EVIL Source: Toms Hardware Guide http://www6.tomshardware.com/cpu/01q3/010917/heatvideo-01.html 13 Other Costs of High Heat Flux

Some chips may already be underclocked due to thermal constraints! (especially mobile and sealed systems) 14 Temporal, Spatial Variations Temperature variation of SPEC applu over time Hot spots increase cooling costs must cool for hot spot 15 Application Variations Wide variation across applications

Architectural and technology trends are making it worse, e.g. simultaneous multithreading (SMT) Leakage is an especially severe problem: exponentially dependent on temperature! ST SMT 420 420 Kelvin Kelvin 410 410 400 400 390

390 380 380 370 370 gzip gzip mcf mcf swim mgrid mgrid applu applu swim eon eon

mesa mesa 16 Heat vs. Temperature Different time scales Heat: no notion of spatial locality Does architecture have a role? Temperature-aware computing: Optimize performance subject to a temperature constraint 17 Overview 1. 2. 3. 4.

5. 6. 7. 8. 9. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot and sensors (Kevin) 18 Thermal issues Temperature affects:

Circuit performance Circuit power (leakage) IC reliability IC and system packaging cost Environment 19 Performance and leakage Temperature affects : Transistor threshold and mobility Subthreshold leakage, gate leakage Ion, Ioff, Igate, delay ITRS: 85C for high-performance, 110C for embedded! Ioff Ion NMOS

20 Temperature-aware circuits Robustness constraint: sets Ion/Ioff ratio Robustness and reliability: Ion/Igate ratio Idea: keep ratios constant with T: trade leakage for performance! Ref: Ghoshal et al. Refrigeration Technologies, ISSCC 2000 Garrett et al. T3, ISCAS 2001 21 Resulting performance 25% - 30% extra performance (110oC to 0oC) TAC regular

22 Reliability The Arrhenius Equation: MTF=A*exp(Ea/K*T) MTF: mean time to failure at T A: empirical constant Ea: activation energy K: Boltzmanns constant T: absolute temperature Failure mechanisms: Die metalization (Corrosion, Electromigration, Contact spiking) Oxide (charge trapping, gate oxide breakdown, hot electrons) Device (ionic contamination, second breakdown, surface-charge) Die attach (fracture, thermal breakdown, adhesion fatigue) Interconnect (wirebond failure, flip-chip joint failure) Package (cracking, whisker and dendritic growth, lid seal failure) Most of the above increase with T (Arrhenius) Notable exception: hot electrons are worse at low temperatures

More on this later 23 Packaging cost From Cray (local power generator and refrigeration) Source: Gordon Bell, A Seymour Cray perspective http://www.research.microsoft.com/users/gbell/craytalk/ 24 Packaging cost To today Grid computing: power plants co-located near compute farms IBM S/390: refrigeration Source: R. R. Schmidt, B. D. Notohardjono High-end server low temperature cooling

IBM Journal of R&D 25 IBM S/390 refrigeration Complex and expensive Source: R. R. Schmidt, B. D. Notohardjono High-end server low temperature cooling IBM Journal of R&D 26 IBM S/390 processor packaging Processor subassembly: complex! C4: Controlled Collapse Chip Connection (flip-chip) Source: R. R. Schmidt, B. D. Notohardjono High-end server low temperature cooling IBM Journal of R&D

27 Intel Itanium packaging Complex and expensive (note heatpipe) Source: H. Xie et al. Packaging the Itanium Microprocessor Electronic Components and Technology Conference 2002 28 Intel Pentium 4 packaging Simpler, but still Source: Intel web site 29 Graphics Cards Nvidia GeForce 5900 card Source: Tech-Report.com

30 More Graphics Cards 31 Under/Overclocking Some chips need to be underclocked Especially true in constrained form factors Ultra model of Gigabyte's 3D Cooler Series Source: Toms Hardware Guide Try fitting this in a laptop or Gameboy! 32 Apple G5 liquid cooling Dont know details Lots of people in thermal engineering

community think liquid is inevitable, especially for server rooms But others say no: This introduces a whole new kind of leakage problem Water and electronics dont mix! 33 Environment Environment Protection Agency (EPA): computers consume 10% of commercial electricity consumption This incl. peripherals, possibly also manufacturing A DOE report suggested this percentage is much lower No consensus, but its still a lot

Equivalent power (with only 30% efficiency) for AC CFCs used for refrigeration Lap burn Fan noise 34 Heat mechanisms Conduction Convection Radiation

Phase change Heat storage 35 Conduction Similar to electrical conduction (e.g. metals are good conductors) Heat flow from high energy to low energy Microscopic (vibration, adjacent molecules, electron transport) No major displacement of molecules Need a material: typically in solids (fluids: distance between mol) Typical example: thermal slug, spreader, heatsink

A Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001 36 Conduction Not a strong function of temperature But for the high temp. variations on high-perf. chips, (30+), it matters Note esp. Si vs. Al, Cu Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001

37 Convection Macroscopic (bulk transport, mix of hot and cold, energy storage) Need material (typically in fluids, liquid, gas) Natural vs. forced (gas or liquid) Typical example: heatsink (fan), liquid cooling Note that convection is profoundly affected by board layout Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001 38 Radiation Electromagnetic waves (can occur in vacuum) Negligible in typical applications

Sometimes the only mechanism (e.g. in space) Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001 39 Carnot Efficiency Note that in all cases, heat transfer is proportional to T This is also one of the reasons energy harvesting in computers is probably not cost-effective T w.r.t. ambient is << 100 For example, with a 25W processor, thermoelectric effect yields only ~50mW Solbrekken et al, ITHERM04 This is also why Peltier coolers are not

energy efficient 10% eff., vs. 30% for a refrigerator 40 Surface-to-surface contacts Not negligible, heat crowding Thermal greases/epoxy (can pump-out) Phase Change Films (undergo a transition from solid to semi-solid with the application of heat) Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001 41 Phase-change Thermal solutions evolution:

Natural air cooling Forced-air cooling Liquid cooling Phase change (e.g. heat pipe) Refrigeration Phase change: a. Solid changing to a liquidfusion, or melting, b. Liquid changing to a vaporevaporation, also boiling, c. Vapor changing to a liquidcondensation, e. Liquid changing to a solidcrystallization, or freezing, f. Solid changing to a vaporsublimation, g. Vapor changing to a soliddeposition. 42 Thermal resistance

= rt / A = t / kA 43 Thermal capacitance Cth = VCp (Aluminum) = 2,710 kg/m3 Cp(Aluminum) = 875 J/(kg-C) V = t A = 0.000025 m3 Cbulk = VCp = 59.28 J/C 44 Refrigeration conventional vs. thermo-electric (TEC) Can get T < T_amb (negative Rth!) TEC: Peltier effect (can use for local cooling) 45

TEC electro-thermal model 46 Simplistic steady-state model All thermal transfer: R = k/A T_hot Power density matters! Ohms law for thermals (steady-state) V = I R -> T = P R T_amb T_hot = P Rth + T_amb Ways to reduce T_hot:

- reduce P (power-aware) - reduce Rth (packaging) - reduce T_amb (Alaska?) - maybe also take advantage of transients (Cth) 47 Simplistic dynamic thermal model

Electrical-thermal duality V temp (T) I power (P) R thermal resistance (Rth) C thermal capacitance (Cth) RC time constant T_hot T_amb KCL differential eq. I = C dV/dt + V/R difference eq. V = I/C t + V/RC t thermal domain T = P/C t + T/RC t (T = T_hot T_amb) One can compute stepwise changes in temperature for any granularity at which one

can get P, T, R, C 48 Combined package model Steady-state Tj junction temperature Tc case temperature Ts heatsink temperature Note: ja is meaningless! exactly is Ta? Guts ofWhat the component jc is better but still sketchy Ta ambient temperature

Source: CRC Press, R. Remsburg Ed. Thermal Design of Electronic Equipment, 2001 49 Reliability as f(T) Reliability criteria (e.g., DTM thresholds) are typically based on worst-case assumptions But actual behavior is often not worst case So aging occurs more slowly This means the DTM design is over-engineered! We can exploit this, e.g. for DTM or frequency Spend Bank 50 EM Model t failure

1 e kT (t ) 0 Ea kT ( t ) Life Consumption Rate: dt th , th const 1 R (t ) e

kT (t ) Ea kT ( t ) Apply in a lumped fashion at the granularity of microarchitecture units, just like RAMP [Srinivasan et al.] 51 Average slowdown Reliability-Aware DTM 0.16 0.12 0.08 0.04 0.00

e r u if g n o C _ se a B C _ gh i H _

n io t c e v n o .. . s e R l a i r

e at M _ ad e r p _S k DTM_controller ic h T DTM_reliability 52

Temperature limits Temperature limits for circuit performance can be measured Temperature limits for reliability are at best an estimate 150 is a reasonable rule of thumb for when immediate damage might occur Chips are typically specified at lower temperatures, 100-125 for both performance and long-term reliability Rule of thumb that every 10 halves circuit lifetime is false Originates from a mil-spec that is debunked 53 Thermal issues summary Temperature affects performance, power, and reliability Architecture-level: conduction only

Very crude approximation of convection as equivalent resistance Convection: too complicated Need CFD! Radiation: can be ignored Use compact models for package Power density is key Temporal, spatial variation are key Hot spots drive thermal design 54

Review of Thermal Issues From ITHERM04 keynote by Ken Goodson, Stanford/Cooligy 55

Recently Viewed Presentations

  • PowerPoint - careerwise.minnstate.edu

    PowerPoint - careerwise.minnstate.edu

    is the education / experience level of this career explorer? Luciana is about to start her last year of high school. She is on the girls' basketball team and enjoys being part of a team. What. are the career goals...
  • Let him sing psalms Matt 7:15-20 Beware of

    Let him sing psalms Matt 7:15-20 Beware of

    1 Sam 16:7 "But the LORD said unto Samuel, Look not on his countenance, or on the height of his stature; ... Heb 2:12 "Saying, I will declare thy name unto my brethren, in the midst of the church will...
  • ITS-Opuscolo-2014-

    ITS-Opuscolo-2014-

    Benvenuti. degli Istituti Tecnici Superiori - I.T.S. 4. Cosa sono? A. Scuole di eccellenza ad alta specializzazione tecnologica, riferite alle aree considerate prioritarie per lo sviluppo economico e la competitive del Paese, realizzate secondo il modello organizzativo della Fondazione di...
  • Exploring Psychology (8th edition) David Myers

    Exploring Psychology (8th edition) David Myers

    Arial Calibri Palatino Linotype Wingdings Times New Roman Office Theme 1_Office Theme 2_Office Theme EXPLORING PSYCHOLOGY EIGHTH EDITION IN MODULES David Myers Consciousness and the Two-Track Mind Drugs Module 7 Slide 4 Slide 5 Drugs Dependence & Addiction Withdrawal &...
  • Sustainable Financing of the HIV Response in Vietnam:

    Sustainable Financing of the HIV Response in Vietnam:

    OPCs located at Preventive Medicine Center (PMC) OPCs located at Provincial HIV/AIDS center (PAC) OPCs located at Provincial and district hospital. PAC is not a curative health facility - not eligible for SHI contracting. PMC has not curative function -...
  • Update on Project Research Component A

    Update on Project Research Component A

    Produce algal toxins called microcystins (hepatotoxins) known to be lethal to animals. Microcystins also seem to affect plant growth & development. Government's approach to wastewater management.
  • Ee369 Power System Analysis

    Ee369 Power System Analysis

    EE369 POWER SYSTEM ANALYSIS Lecture 3 Three Phase, Power System Operation Tom Overbye and Ross Baldick * * Reading and Homework For lecture 3 read Chapters 1 and 2 For lectures 4 through 6 read Chapter 4 we will not...
  • Innovations in College Counseling for Distance Learners Student

    Innovations in College Counseling for Distance Learners Student

    2. None of the sessions will be recorded or photographed. 3. Written records summarizing the contents of the discussions and the therapist's diagnosis and treatment plan will be maintained in the Student Health and Counseling electronic behavioral health record at...