A Performance Analysis Framework for Identifying Potential ...

A Performance Analysis Framework for Identifying Potential ...

A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications Jaewoong Sim Aniruddha Dasgupta Hyesoon Kim Richard Vuduc 1 Outline | Motivation | GPUPerf: Performance analysis framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 2/26 GPGPU Programming 3/26 | GPGPU architectures have become very powerful. | Programmers want to convert CPU applications to GPGPU applications. | Case 1: 10x speed-up CPU CPU Version Version GPGPU GPGPU Version Version | Case 2: 1.1x speed-up CPU CPU Version Version GPGPU GPGPU Version Version | For case 2, programmers might wonder why the benefit is so poor. Maybe, the algorithm is not parallelizable Programmers to optimize code whenever Maybe, the GPGPUwant code are not well optimized possible! | For case 1, programmers might wonder if 10x is the best speed-up. GPGPU Optimizations 4/26

| Optimizing parallel programs is difficult^100! Normalized Performance 1.8 1.6 Best Best for for this this kernel kernel Still Still the the best! best! 1.4 1.2 1 0.8 0.6 0.4 0.2 0 | Most of programmers apply optimization techniques one by one. Programmers want to understand performance benefit! | Try one more optimization with Shared Memory. Which one to choose? GPGPU Performance Guidance 5/26 | Providing performance guidance is not easy. Program analysis: Obtain program information as much as possible Performance modeling: Have a sophiscated analytical model User-friendly metrics: Convert the performance analysis information into performance guidance | We propose GPUPerf, performance analysis framework Quantatively predicts potential performance benefits | In this talk, we will focus more on performance modeling and potential benefit metrics Outline | Motivation | GPUPerf: Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 6/26

GPUPerf Overview 7/26 | What is required for performance guidance? GPGP U Kernel Program analysis Performance modeling User-friendly metrics Frontend Data Collector ILP, #insts Analytical Model Model output Performance Advisor GPUPerf GPUPerf For clarity, each component will be explained in a reverse order Bene fit Metric s Performance Advisor Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 8/26 Advisor | Goal of the performance advisor Convey performance bottleneck information Estimate the potential gains from reducing the bottlenecks | Performance advisor provides four potential benefit metrics Benefit metrics are

provided by our Bitilp : benefits of increasing ITILP analytical model Bmemlp : benefits of increasing MLP Bserial : benefits of removing serialization effects Bfp : benefits of improving computing inefficiency | Programmers can get an idea of the potential benefit of a GPGPU Kernel Previous Work: MWP-CWP Model | MWP (Memory Warp Parllelism) | CWP (Compute Warp Parllelism) Indicator of memory-level parallelism Mem Mem Mem Mem Com Com p p Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem 8 warps MWP=4 Mem Mem Com Com p p Com Com p p CWP=3 Time | Depending on MWP and CWP, the execution time is predicted by the model. MWP-CWP [Hong and Kim, ISCA09]

| The MWP-CWP model can predict general cases. | Problem: did not model corner cases, which is critical to predict different program optimization benefits! 9/26 Analytical Model Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 10/26 Advisor | Our analytical model follows a top-down approach Easy to interpret model components Relate them directly to performance bottlenecks Tcomp comp Com Com p p Texec exec Texec exec = Tcomp comp + Tmem mem Toverlap overlap Tmem mem Toverlap overlap Texec Mem Mem Com Com p p 4 warps Mem Mem Com Com p p

time Mem Mem Com Com p p Com Com p p Com Com p p Tcomp Mem Mem Tcomp Com Com p p Time : Final execution time Tmem : Memory time Toverlap : Overlapped Tmem Com Com p p T overlap Mem TMem overlap Mem Mem : Computation Mem Mem Mem Mem time MWP=2 Analytical Model Frontend Frontend Data Data Collector Collector Analytical Analytical Model

Model Performanc Performanc e e Advisor 11/26 Advisor | Tcomp is the amount of time to execute compute instructions Tcomp comp = Wparallel parallel + Wserial serial Texec exec Tcomp comp Wparallel parallel Tmem mem Wserial serial Toverlap overlap Wparallel : Work executed in parallel (useful work) Wserial effects : Overhead due to serialization Analytical Model Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 12/26 Advisor | Wparallel is the amount of work that can be executed in parallel Texec exec Tcomp comp

Wparallel parallel Tmem mem Wserial serial Toverlap overlap Effective inst. throughput = f(warp_size, SIMD_width, # pipeline stages) ITILP represents the number of instructions that can be= Total insts Effective inst. throughput | W parallel parallely executed in the pipeline. | Effective Inst. throughput = Analytical Model Frontend Frontend Data Data Collector Collector ITILP (Inter-thread ILP) | ITILP is inter-thread ILP. ExecutionTime (msec) 25000 TLP =1 20000 TLP =2 TLP =3 Inst1 Inst1 15000 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 Low 10000 ILP =4/3 Inst2 Inst3 Inst2 Inst3 Inst2 Inst3 Inst2 Inst3 Inst2 Inst3 Inst2 5000 Inst3 Inst2 Inst3 Inst2

Inst3 TLP (N) TLP =1 ITILP=4/3 TLP =2 ITILP=8/3 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 Inst1 stal stalActual ll ITILP Inst4 Inst4 0 Inst4 Inst4 Inst4 Inst4 Inst4 Inst4 TLP =3 ITILP=ITILP max stal stal ll stal stal ll Inst1 Inst1 Inst3 Inst3 Inst3 Inst3 Inst3 Inst3 stal stal ll Inst2 Inst2 stal stal Inst2

Inst2 stal stal Inst3 Inst3 Inst3 Inst3 Inst4 Inst4 Inst4 Inst4 Inst4 Inst4 Inst4 Inst4 Inst4 Execution latency Model (MWP-CWP) As TLP increases, Model (New) is already all stall Inst2 time reduces Inst2 Inst2 Inst2 Inst2 hidden! Inst2 Inst4 ILP X TLP ITILP = MIN(ILP N, ITILPmax) ITILPmax =avg_inst_lat/ Performanc Performanc e e Advisor 13/26 Advisor Analytical Analytical Model Model Time ll ll Inst2 Inst2 Inst3 Inst3 Analytical Model Frontend Frontend Data Data Collector

Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 14/26 Advisor | Wserial represents the overhead due to serialization effects Texec exec Tcomp comp Wparallel parallel Tmem mem Wserial serial = Osync sync + OSFU SFU + OCFDiv CFDiv + Obank bank Toverlap overlap Wserial serial Osync sync OSFU SFU OCFDiv CFDiv Obank bank Osync : Synchroization overhead OSFU : SFU contention overhead OCFDiv : branch divergence overhead Obank : Bank-conflict overhead Analytical Model Frontend Frontend Data Data Collector Collector SFU Overhead Analytical Analytical Model

Model Performanc Performanc e e Advisor 15/26 Advisor | GPGPU has SFUs where expensive operations can be executed. With a good ratio of insts and SFU insts, SFU executing cost can be hidden. SFU Inst Ins t Ins t Ins t Ins t Ins t SFU Inst High Inst to SFU ratio Ins t SFU Inst Ins t Ins t SFU Inst OSFU SFU Inst Low Inst to SFU ratio Execution Time (msec) Ins Ins t t SFU Inst Ins Ins t t SFU Inst Ins t Ins t SFU Inst

600 500 Actual Model (MWPCWP) 400 300 Latency of SFU instructions is not completely hidden in this case! 200 100 0 1 2 3 4 5 6 # of SFU insts. per eight FMA insts. 7 Analytical Model Frontend Frontend Data Data Collector Collector Performanc Performanc e e Advisor 16/26 Advisor Analytical Analytical Model Model | Tmem represents the amount of time spent on memory requests and transfers Texec exec Tcomp comp Com Com p p Mem Mem Com Com p p Tmem

mem = Effective mem. requests AMAT Tmem mem Mem Mem Com Com p p Toverlap overlap Mem Mem Mem Mem Mem Mem Com Com p p Mem Mem Mem Mem MWP=2 Tmem = 4MEM / 2 Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Tmem = 4MEM / 1 MWP=1 Analytical Model Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 17/26 Advisor

| Toverlap represents how much the memory cost can be hidden by multi-threading Texec exec Tcomp comp Tmem mem Toverlap overlap Toverlap Tmem Comp Comp Mem Mem Comp Comp MWP=3 Mem Mem Mem Mem Comp Comp Comp Comp CWP=3 MWP CWP Comp Comp All the memory costs are overlapped with computation Analytical Model Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 18/26 Advisor | Toverlap represents how much the memory access cost can be hidden by multi-threading Texec exec Tcomp

comp Tmem mem Toverlap overlap Toverlap Tcomp Mem Mem Comp Comp Mem Mem Mem Mem Comp Comp Mem Mem Mem Mem MWP=2 Comp Comp Comp Comp CWP=4 CWP > MWP Comp Comp Comp Comp Computation cost is hidden by memory cost Benefit Chart Frontend Frontend Data Data Collector Collector Analytical Analytical Model Model Performanc Performanc e e Advisor 19/26 Advisor | Time metrics are converted into potential benefit metrics. Comp Cost Bmemlp

Tcomp Toverla p Single Thread Bserial Tfp : ideal computation cost Tmem_min : ideal memory cost Bitilp Bfp Optimized Optimized Kernel Kernel Tfp Tmem_mi n Tmem Potential Potential Benefit Benefit Chart Chart Tmem Mem Cost Benefit Metrics Benefits of Bmemlp Increasing MLP Bserial Removing serialization effects Bitilp Increasing inter-thread ILP Bfp Improving computing efficiency Frontend Data Collector Frontend Frontend Data Data Collector Collector Frontend Data Collector CUDA Executable Ocelot

Ocelot [Diamos [Diamos et et al., al., PACT10] PACT10] Ocelot Executab le CUDA Binary (CUBIN) Compute Visual Profiler Instruction Analyzer (IA) Advisor | Architecture-related information from H/W counters #insts, global LD/ST requests, cache info #SFU_Inst | Detailed information from emulating PTX s executions #SFU insts, #sync insts, loop counters Static Analysis Tools Performanc Performanc e e Advisor 20/26 Advisor Performance Analytical Model #Insts Occupanc y Analytical Analytical Model Model ILP, MLP, ... The collected information is fed into our analytical model | Information in CUDA binaries instead of PTX after low-level compiler optimizations

ILP, MLP Outline | Motivation | GPUPerf: A Performance Analysis Framework Performance Advisor Analytical Model Frontend Data Collector | Evaluations | Conclusion 21/26 Evaluation 22/26 | NVIDIA C2050 Fermi architecture | FMM (Fast Multi-pole Method): approximation of n-body [Winner, 2010 Gordon Bell Prize at Supercomputing] problem Prefetching SFU Vector Packing Loop Unrolling Shared Memory Loop optimizatio n 44 Optimization combinations | Parboil benchmarks, Reduction (in the paper) Results - FMM 23/26 Actual 3.5 3 Speedup over no optimizations 2.5 2 1.5 1 0.5 0 ef pr t t t t t

t t t t t t t t m ght ght m ck ns ns qr qr qr sqr qr igh qr igh sqr igh jam igh qr qr e e a a a s s i s s s s i s r r r r t r u m m _t _t _r _t _t ecp _r _r _r _t _r _r _t _t s_ ns_ s_ rt rt sh qrt sh am m am am am ef m m n n m m v q r q

_ a a a e e s s j s j j j j j j p k ra ra ra _r _r hm m _r _t _t _u _u _u _u _t _u _ u s_ u k_ ac f f h s c m m p m s s e e m m a a e a em em ran ran em em pr vec _pr ck_ me me uj uj hm cp t t _ m m m m _ e _ _ k h

h f a h h s v c s sh s sh em cp re _ s k_ s em em pa k e m _p c c c v m hm h a a ve sh s _s em cp ecp k e m c v v sh pa c ve Vector packing + Shared memory + UnrollJam + SFU combination shows the best performance Optimizations Results - FMM 3.5 3 24/26 Our model follows the Actual speed-up trend pretty well Prediction Speedup over no optimizations 2.5 2 1.5 1 0.5 0 ef pr

Our model correctly pinpoints the best optimization combination that improves the kernel t t t t t t t t t t t t m ght ght m ght m ck ns ns qr qr qr igh qr igh qr igh qr qr qr qr e a e a a a s j i i s s s s s i s s s r r r r t r u m m _t _r _t ecp _r _t _t _r _t _r _r _r _t _t s_ ns_ s_

rt rt sh qrt sh m am am am ef am m m n n m m v q q r _ a e e s s j j j j s j ja ja p k ra ra ra _r _r hm _r _t _t _ u s_ u _u _u _u _t _u _u hm k_ ac f f s c p m m m s s e e m m a e a a em em ran ran em em pr vec _pr ck_ me me uj uj hm cp t

t _ m m m m _ e _ _ k a f h h s v c sh sh sh sh em cp re _ s k_ s em em pa k e m _p c c c v m hm h ve pa cpa sh s _s em c k m c ve ve sh pa c ve Optimizations 25/26 4 0.4 3 0.3 0.2 B_fp vecpack_rsqrt _shmem_ujam _pref Bfp

fp : Computing ineffciency vecpack_rsqrt _shmem_ujam 0 vecpack_rsqrt _shmem 0 vecpack_rsqrt 0.1 vecpack 1 Speedup 2 baseline Normalized Benefits Results Potential Benefits (Higher (Higher is is wrose) wrose) | Bfp implies that the kernel could be improved via optimizations | Small Bfp value indicates that adding Prefetching(Pref) does not lead to further performance improvement Conclusion 26/26 | We propose performance analysis framework. Front-end data collector, analytical model and performance advisor. | Performance advisor provides potential benefit metrics, which can guide performance tuning for GPGPU code. (Bmemlp, Bserial, Bitilp, Bfp). 44 optimization combinations in FMM are well predicted. | Future work: the performance benefit advisor can be inputs to compilers.

Recently Viewed Presentations

  • Business Valuation

    Business Valuation

    Enterprise value is one of the fundamental metrics used in business valuation, financial modeling, accounting, portfolio analysis, and risk analysis. Enterprise value is more comprehensive than market capitalization, which only reflects common equity.
  • TOPIC - Amgah

    TOPIC - Amgah

    ONCOLOGY: Cervix Sarcoma. Sarcoma. of the . cervix therapy. Sarcomas. of the cervix show a variable and generally poor response to . radiotherapy. If possible therefore, surgical removal is indicated if the disease is localized to the pelvis.
  • A.P. Chemistry

    A.P. Chemistry

    14.7 Polyprotic Acids. Polyprotic. acids - acids which can furnish more than 1 proton. (See Table 14.4, p. 689) Characteristics of Weak Polyprotic Acids (p. 694) Typically, successive K. a. values are so much smaller than the first value, that...
  • Cessna 402 SID Development Program - Volarenvenezuela.com

    Cessna 402 SID Development Program - Volarenvenezuela.com

    Cessna Wing Spar Reinforcement Beth Gamble Cessna Aircraft Company Introduction Wing Spar Reinforcement Wing Spar Reinforcement Cessna has issued mandatory service bulletins to install wing main spar reinforcement straps to assure continued airworthiness of Cessna's Model 400 series airplanes The...
  • Taking back control: the Pure approach to ... - ARMA Conference

    Taking back control: the Pure approach to ... - ARMA Conference

    ARMA. Liverpool. 6 June 2017. Pure. ... Submission system released with technical documentation and validation rules early 2020. Would expect final Pure development delivered potentially . June 2020. Submission November 2020. Manya Buchan, Pure Product Manager.
  • Title of Presentation

    Title of Presentation

    We created a custom app engine based on the delivered app engine HR_EMPLPHOTO . The initial run will took around 12 hours to convert all photos (778204 photos) in PS_EMPL_PHOTO table. Functional testing - as time goes by (Set Up...
  • RuleML/Grailog - UNB

    RuleML/Grailog - UNB

    While Grailog's DLG sublanguage corresponds to binary-associative memories, its hypergraph sublanguage corresponds to n-ary content-addressable memories, and its complex-node modules offer various further opportunities for parallel processing * Visualization of Data Useful in many areas, needed for big data Gain...
  • AS Revision - 1 Based on previous questions,

    AS Revision - 1 Based on previous questions,

    SV x HR ↑ increase. Q. sweating. blood volume. Typical question. Jan 12 Qu 2 (b) During a game, the blood pressure of a player increases. What factors determine the 'blood pressure' in arteries? (3 marks) May 12 . Qu....