Equalized 4Gb/s Signalling

Equalized 4Gb/s Signalling

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University [email protected] Outline Motivation We need low-power, programmable TeraOps The problem is bandwidth Growing gap between special-purpose and generalpurpose hardware Its easy to make ALUs, hard to keep them fed A stream processor gives programmable bandwidth Streams expose locality and concurrency in the application A bandwidth hierarchy exploits this Imagine is a 20GFLOPS prototype stream processor Many opportunities to do better Scaling up Simplifying programming March 30, 2001 Convergence Workshop 2 Motivation Some things Id like to do with a few TeraOps Have a realistic face-to-face meeting with someone in Boston without riding an airplane 4-8 cameras, extract depth, fit model, compress, render to several screens High-quality rendering at video rates Ray tracing a 2K x 4K image with 105 objects at 60 frames/s March 30, 2001 Convergence Workshop 3 The good news FLOPS are cheap, OPS are cheaper 32-bit FPU 2GFLOPS/mm2 400GFLOPS/chip 16-bit add 40GOPS/mm2 8TOPS/chip 460 m 146.7 m Local RF Integer Adder March 30, 2001 Convergence Workshop 4 The bad news General purpose processors cant harness this 1e+15 1e+14 FLOPS 1e+13

FLOPS GP-Peak GP-Useful 1e+12 1e+11 1e+10 1e+9 1e+8 2001 2003 2005 2007 2009 2011 Year March 30, 2001 Convergence Workshop 5 Why do Special-Purpose Processors Perform Well? Lots (100s) of ALUs March 30, 2001 Fed by dedicated wires/memories Convergence Workshop 6 IP Care and Feeding of ALUs Instruction Bandwidth Instr. Cache IR Data Bandwidth Regs Feeding Structure Dwarfs ALU March 30, 2001 Convergence Workshop 7 The problem is bandwidth Can we solve this bandwidth problem without sacrificing programmability? March 30, 2001 Convergence Workshop 8 Streams expose locality and concurrency Operations within a kernel operate on local data

Image 0 convolve Kernels can be partitioned across chips to exploit control parallelism convolve SAD Image 1 convolve Depth Map convolve Streams expose data parallelism March 30, 2001 Convergence Workshop 9 A Bandwidth Hierarchy exploits locality and concurrency SDRAM SDRAM SDRAM ALU Cluster Stream Register File SDRAM 2GB/s ALU Cluster ALU Cluster 544GB/s 32GB/s VLIW clusters with shared control 41.2 32-bit operations per word of memory bandwidth March 30, 2001 Convergence Workshop 10 SDRAM SDRAM SDRAM SDRAM Stream Register File Bandwidth Usage 2GB/s ALU Cluster ALU Cluster ALU Cluster

544GB/s 32GB/s Memory BW Global RF BW Local RF BW Depth Extractor 0.80 GB/s 18.45 GB/s 210.85 GB/s MPEG Encoder 0.47 GB/s 2.46 GB/s 121.05 GB/s Polygon Rendering 0.78 GB/s 4.06 GB/s 102.46 GB/s QR Decomposition 0.46 GB/s 3.67 GB/s 234.57 GB/s March 30, 2001 Convergence Workshop 11 The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Network Interface ALU Cluster 7 ALU Cluster 6 ALU Cluster 5 ALU Cluster 4 ALU Cluster 3 ALU Cluster 2 ALU Cluster 1 Stream Register File

ALU Cluster 0 Microcontroller Host Processor Stream Controller Network Streaming Memory System Imagine Stream Processor March 30, 2001 Convergence Workshop 12 Local Register File + + + * * To SRF / CU Intercluster Network Arithmetic Clusters Cross Point From SRF March 30, 2001 Convergence Workshop 13 Performance 30 25 floating-point application 23.9 16-bit applications 25.6 16-bit kernels 19.8 GOPS 20 15 12.1 floating-point kernel

11.0 10 7.0 5 0 depth March 30, 2001 mpeg qrd dct Convergence Workshop convolve fft 14 Power Other Mem Sys Pins SRF Clust Clock 3.0 2.5 Watts 2.0 6% 1.5 1% 2% 5% 23% 1.0 0.5 63% 0.0 depth GOPS/W: 4.6 March 30, 2001 mpeg qrd dct convolve fft average 10.7 4.1

10.2 9.6 2.4 6.9 Convergence Workshop 15 A Look Inside an Application Stereo Depth Extraction 320x240 8-bit grayscale images 30 disparity search 220 frames/second 12.7 GOPS 5.7 GOPS/W March 30, 2001 Convergence Workshop 16 Stereo Depth Extractor BlockSAD Convolutions CONV 3x3 21300 21400 21500 Clusters Mem_0 501300 Mem_1 501400 UNPACK STORE CONV 7x7 501700 Load original packed row 21900 22000 Unpack (8bit -> 16 bit) 501900 BlockSAD Store Calculate BlockSADs at different disparities 502200 BlockSAD 22400 UNPACK

STORE CONV 7x7 502300 LOAD Load Convolved Rows BlockSAD 502100 CONV 3x3 22300 502400 22600 502500 22700 BlockSAD 502600 22800 7x7 Convolve 22900 23000 Load Load 502700 502800 23100 BlockSAD 502900 3x3 Convolve CONV 3x3 23300 Store convolved row 23400 23600 Load 501800 502000 22100 23500 Mem1 Load

501600 21800 23200 Mem0 501500 21700 22500 Clust BlockSAD LOAD 21600 22200 Disparity Search UNPACK CONV 7x7 STORE LOAD 503000 503100 503200 503300 BlockSAD Store Store best disparity values 7x7 Convolve Kernel ADD0 ADD0 ADD1 P AS S IADDS 16 IADDS 16 IA DDS 16 IADDS 16 ADD2 MUL0 S HIFTA16 IMULR ND16 IMULR ND16 IA DDS 16 IMULR ND16 MUL1

IMULR ND16 IMULRND16 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 NS E LE CT PASS PASS NS E LE CT MC_0 J UK0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_ 0 J UK0 VAL0 IMULRND16 G S HUFFLE IA DDS 16 IADDS 16

IMULRND16 VAL0 IMULRND16 CO E N_ C I S T A T E ND_ I N_ D G S HUFFLE IMULR ND16 IMULR ND16 PASS S P C R E A D_ W IADDS 16 IADDS 16 IA DDS 16 IMULR ND16 IMULRND16 PASS T S P C W R I TE CO M M UC DA TA C H K_ A N Y IADDS 16 IADDS 16 IADDS 16 IADDS 16 IADDS 16 S HUFFLE IA DDS 16 S HUFFLE S HUFFLE IMULR ND16 IMULRND16 IMULR ND16 S HUFFLE IMULR ND16 IMULRND16 IMULRND16 PASS S EL E C T IMULR ND16

IMULRND16 S HI F T A 16 IADDS 16 IADDS 16 IADDS 16 IADDS 16 IADDS 16 IA DDS 16 IADDS 16 IA DDS 16 IADDS 16 IA DDS 16 IA DDS 16 COMMUC P E R M IMULR ND16 PASS PASS S E LE C T COMMUC P E R M S E LE C T PASS PASS PASS IADDS 16 IA DDS 16 IADDS 16 IA DDS 16 IMULRND16 COMMUC P E R M S E LE C T IADDS 16 IADDS 16 COMMUC P E R M IMULRND16 S HUFFLE S HUFFLE IADDS 16 IMULRND16 PASS S HUFFLE IA DDS 16 IMULR ND16 IMULRND16 S HUFFLE IADDS 16

IMULR ND16 IMULRND16 IMULR ND16 PASS IMULRND16 IMULRND16 IMULRND16 CO M M U C P E RM CO M M U C P E RM CO M M U C P E RM CO M M U C P E RM NS E LE CT S EL E C T IADDS 16 IA DDS 16 IADDS 16 IMULR ND16 IMULR ND16 S EL E C T IMULR ND16 IMULRND16 PASS I M U L R ND 1 6 S EL E C T I M S HUFFLE S HUFFLE S HUFFLE IMULR ND16 IMULRND16 DATA _IN I M U L R ND 1 6 IADDS 16 IADDS 16

S HUFFLE IADDS 16 IA DDS 16 IADDS 16 IMULRND16 IMULR ND16 IMULRND16 IMULR ND16 PASS PASS COND_IN_D DATA _IN PASS DATA _IN PASS GE N_C IS TA TE IADDS 16 IA DDS 16 IADDS 16 IADDS 16 IMULR ND16 IMULR ND16 IMULRND16 IMULRND16 S E LE C T DATA _IN I M U L R ND 1 6 GE N_C CE ND NS E LE CT DATA _OUT S P C RE A D_WT S P CWR ITE I M U L R ND 1 6 I M U L R ND 1 6 S E LE CT I M I M IA DDS 16 I M U L R ND 1 6 IA DDS 16 IMULRND16 IMULRND16 S E LE C T

DATA _IN DATA _OUT NS E LE CT DATA _IN DATA _OUT IADDS 16 IMULR ND16 IMULR ND16 C HK_ANY I M U L R ND 1 6 I A DDS1 6 IA DDS 16 IA DDS 16 IMULR ND16 IMULRND16 NS E LE CT DATA _OUT NS E LE CT DATA _OUT NS E LE CT DATA _OUT PASS I A D DS 1 6 IADDS 16 IADDS 16 IA DDS 16 IADDS 16 IA DDS 16 IMULR ND16 IMULRND16 IMULRND16 IMULRND16 S E LE C T S E LE C T DA TA_OUT P AS S P AS S P AS S I M U L R ND 1 6 I M ULR ND1 6

COMMUC DATA I M U L R ND 1 6 IADDS 16 IADDS 16 ULR ND1 6 ULR ND1 6 I M U L R ND 1 6 I M U L R ND 1 6 IADDS 16 NSE L E C T I M U L R ND 1 6 I M U L R ND 1 6 I M U L R ND 1 6 IADDS 16 ULR ND1 6 PASS LOOP I A DD S 1 6 I A DDS1 6 I A DD S 1 6 PASS I A D DS 1 6 S HUFF L E I A DD S 1 6 I AD D S 1 6 I A DD S 1 6 I A D DS 1 6 I A DD S 1 6 I A D DS 1 6 I A DD S 1 6 I A D DS 1 6 I M U L R ND 1 6 I M U L R ND 1 6 I A DDS1 6 I A DD S 1 6 S HUF FL E I A DD S 1 6 I AD D S 1 6 S HUF FL E

I M U L R ND 1 6 I M I M S H UF F L E I A DDS1 6 I A DD S 1 6 I A DD S 1 6 I M I M I AD D S 1 6 I A DD S 1 6 I M U L R ND 1 6 I M U L R ND 1 6 I A DD S 1 6 I M U L R ND 1 6 P AS S NSE L E C T P AS S P AS S NSE L E C T ULR ND1 6 P AS S P AS S ULR ND1 6 I M U L R ND 1 6 P AS S I M U L R ND 1 6 I M U L R ND 1 6 I M U L R ND 1 6 NSE L E C T S EL E C T ULR ND1 6 I M U L R ND 1 6 I M U L R ND 1 6 S HUF FL E I A DDS1 6 I M U L R ND 1 6 ULR ND1 6 I M U L R ND 1 6

I M U L R ND 1 6 I M U L R ND 1 6 ULR ND1 6 ULR ND1 6 I M U L R ND 1 6 I M U L R ND 1 6 I A DD S 1 6 I A D DS 1 6 I M I M I M U L R ND 1 6 S HUFF L E I A D DS 1 6 I M U L R ND 1 6 I M U L R ND 1 6 I A DD S 1 6 I A DD S 1 6 I M U L R ND 1 6 ULR ND1 6 I M U L R ND 1 6 I M U L R ND 1 6 I M U L R ND 1 6 I A DD S 1 6 I A DDS1 6 I A DDS1 6 I M I M U L R ND 1 6 I M I M ULR ND1 6 ULR ND1 6 I M U L R ND 1 6 I M U L R ND 1 6 I M ULR ND1 6 P AS S P AS S S H UF F L E

S HUFF L E S HUF FL E S H UF F L E I A D DS 1 6 I A DDS1 6 I A DD S 1 6 I A DD S 1 6 I A D DS 1 6 I A DD S 1 6 I A DD S 1 6 I A DDS1 6 I A DD S 1 6 I AD D S 1 6 S HUF FL E S H UF F L E S HUF FL E I A DD S 1 6 I A DDS1 6 I A DD S 1 6 I A D DS 1 6 I AD D S 1 6 I A DD S 1 6 I A DDS1 6 I A DD S 1 6 I AD D S 1 6 I A DD S 1 6 I AD D S 1 6 I A DD S 1 6 I A DD S 1 6 P AS S I AD D S 1 6 I A D DS 1 6 I A DD S 1 6 P AS S P AS S I A DD S 1 6 S HUFF L E I A D DS 1 6 P AS S

P AS S DA T A _ I N DA T A _ I N P AS S DA T A _ I N S EL E C T DA T A _ I N P AS S P AS S NSE L E C T Convergence Workshop UT S EL E C T DA T A _ I N D AT A _ O UT NSE L E C T DA T A _ I N D AT A _ O UT D AT A _ O UT D AT A _ O UT D AT A _ O UT NSE L E C T I AD D S 1 6 DA T A _ O March 30, 2001 D AT A _ O UT S EL E C T NSE L E C T L O O P 18

E N_ CCE ND Imagine gives high performance with low power and flexible programming Matches capabilities of communication-limited technology to demands of signal and image processing applications Performance compound stream operations realize >10GOPS on key applications can be extended by partitioning an application across several Imagines (TFLOPS on a circuit board) Power three-level register hierarchy gives 2-10GOPS/W Flexibility programmed in C streaming model conditional stream operations enable applications like sort March 30, 2001 Convergence Workshop 19 A look forward Next steps Build some Imagine prototypes Dual-processor 40GFLOPS systems, 64-processor TeraFLOPS systems Longer term Industrial Strength Imagine 100-200GFLOPS/chip Multiple sets of arithmetic clusters per chip, higher clock rate, on-chip cache, more off-chip bandwidth Graphics extensions Texture cache, raster unit as SRF clients A streaming supercomputer 64-bit FP, high-bandwidth global memory, MIMD extensions Simplified stream programming Automate inter-cluster communication, partitioning into kernels, sub-word arithmetic, staging of data. March 30, 2001 Convergence Workshop 20 Take home message VLSI technology enables us to put TeraOPS on a chip Conventional general-purpose architecture cannot exploit this The problem is bandwidth Casting an application as kernels operating on streams exposes locality and concurrency A stream architecture exploits this locality and concurrency to achieve high arithmetic rates with limited bandwidth Bandwidth hierarchy, compound stream operations Imagine is a prototype stream processor One chip 20GFLOPS peak, 10GFLOPS sustained, 4W Systems scale to TeraFLOPS and more. March 30, 2001 Convergence Workshop 21

Recently Viewed Presentations

  • Author: Carl Hiaasen Genre: Realistic Fiction By: Taylor

    Author: Carl Hiaasen Genre: Realistic Fiction By: Taylor

    Roy Ebrehardt is the new kid in school at Trace Middle in Florida . He and his Mom and Dad have lived all over the United States. Through a series of adventures Roy made friends with a runaway boy called...
  • Challenges in western water management: What can science

    Challenges in western water management: What can science

    Dennis P. Lettenmaier Department of Civil and Environmental Engineering University of Washington for presentation at ... ("Making the desert bloom" mentality pre ~1980 resulted in essentially all reservoir sites being taken) Climate change is reducing natural storage afforded by snowpack...
  • Temperature and Heat - Kyrene School District

    Temperature and Heat - Kyrene School District

    Temperature and Heat Heat is a flow of energy due to temperature differences ... Temperature is measured in units called degrees (oC,F,K) Fahrenheit: Water freezes 32oF and boils at 212oF Celsius: Water freezes at 0oC and boils at 100oC How...
  • The Impact of Globalisation on the Health of Poor People

    The Impact of Globalisation on the Health of Poor People

    Misión de la Universidad de Pittsburgh "… hacer disponibles a las comunidaes locales y a las agencias públicas, la experiencia de la Universidad en formas que son consistentes con la enseñanza e investigación y contribuir al desarrollo social, intelectual y...
  • Preparing for an Academic Career in Geosciences Workshop:

    Preparing for an Academic Career in Geosciences Workshop:

    Preparing for an Academic Career in Geosciences Workshop: Summer 2010 Incorporating Data Analysis into Undergraduate Courses Jeff Marshall, Cal Poly Pomona
  • SCOUT Basic User Training - California

    SCOUT Basic User Training - California

    Basic User Training. Version 1.0. April2017. The SCOUT Basic User curriculum is intended for use by certified SCOUT Instructors to conduct Basic User training for Cal OES, CAL Fire and Local Agency SCOUT users.
  • Diapositiva 1

    Diapositiva 1

    As discharge and mortality diagnoses are often not validated on a routine basis, validation studies are necessary to check the diagnostic quality in hospital discharge data. If a register ora surveillance system exist, this unfortunatly is not always the case,...
  • ABC-Impeller

    ABC-Impeller

    Smart Decisions Influenced by Cutting Tool Information CAD/CAM choice of operations, machines, cutting tools creation of efficient tool paths Tool Management efficient inventory and service of items in tool crib selection and creation of tool assemblies Simulation verification of tool...