Performance Implications of FIM Indirect Indexing

Performance Implications of FIM Indirect Indexing

Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB [email protected] Mark Govett, Tom Henderson Jim Rosinski 01/17/20 1 Goal for NIM 01/17/20

2 Optimizations to be discussed NIM: The halo to be communicated between processors is packed and unpacked on the GPU No copy of entire variable to and from the CPU About the same speed as the CPU Halo computation Overlapping communication with computation Mapped, pinned memory NVIDIA GPUDirect technology

01/17/20 3 Halo Computation Redundant computation to avoid communication Calculate values in the halo instead of MPI send Trades computation time for communication time GPUs create more opportunity for halo comp NIM has halo comp for everything not requiring extra communication

NIM next step is to look at halo comps that require new but less often communication 01/17/20 4 Overlapping Communication with Computation Works best with a co-processor to handle comm Overlap communication with other calculations between when a variable is set and used.

Not enough computation time on the GPU Calculate perimeter first then do communication while calculating the interior Loop level: Not enough computation on the GPU Subroutine level: Not enough computation time Entire dynamics: Not feasible for NIM 01/17/20 5 Overlapping Communication with

Computation: Entire Dynamics Perimeter Interior 14 exchanges per time step 3 iteration Runge Kutta loop Exchanges in the RK loop Results in a 7 deep halo Way too much communication More halo comp? Move exchanges out of RK loop? Considerable code restructuring required. 01/17/20

6 Mapped, Pinned Memory: Theory Mapped, pinned memory is CPU memory Mapped so GPU can access it across PCIe bus Page-locked so the OS cant swap it out Limited amount Integrated GPUs: Always a performance gain Discrete GPUs (what we have) Advantageous only in certain cases

The data is not cached on the GPU Global loads and stores must be coalesced Zero-copy: Both GPU and CPU can access data 01/17/20 7 Mapped, Pinned Memory: Practice Pack the halo on GPU SendBuf = VAR Zero-copy 2.7X slower Why?

Unpack halo on GPU VAR = RecvBuf Zero-copy unpack same speed but no copy Using mapped, pinned memory for fast copy SendBuf is mapped and pinned Regular GPU array (d_buff) is packed on GPU d_buff is copied to SendBuf Twice as fast as copying d_buff to a CPU array 01/17/20

8 Mapped, Pinned Memory: Results NIM 10242 horizontal, 96 vertical 10 processors Lowest value selected to avoid skew 01/17/20 9 Mapped, Pinned Memory: Results 01/17/20 10

NVIDIA GPUDirect Technology Eliminates the CPU in interprocessor communication Based on an interface between the GPU and InfiniBand Both devices share pinned memory buffers Data written by GPU can be sent immediately by InfiniBand Overlapping communication with computation?

No longer a co-processor to do the comm? We have this technology but have yet to install it 01/17/20 11 Questions? 01/17/20 12

Recently Viewed Presentations

  • Quantum Dots in the Undergraduate Chemistry Curriculum Authors:

    Quantum Dots in the Undergraduate Chemistry Curriculum Authors:

    Dance, I; Choy, Anna; Scudder, Marcia. ... Old Face Tahoma Wingdings Times Geneva Arial Symbol Baskerville Default Design Microsoft Excel Worksheet Quantum Dots in the Undergraduate Chemistry Curriculum PowerPoint Presentation The Plan 5 Week Chemistry Module Quantum Dot Seminar Synthesis...
  • Fast Monte-Carlo Algorithms for Matrix Multiplication

    Fast Monte-Carlo Algorithms for Matrix Multiplication

    Popular approaches to network analysis Define simple statistics (clustering coefficient, degree distribution, etc.) and fit simple models more complex statistics are too algorithmically complex or statistically rich fitting simple stats often doesn't capture what you wanted Beyond very simple statistics:...
  • Show Don't Tell - Winston-Salem/Forsyth County Schools

    Show Don't Tell - Winston-Salem/Forsyth County Schools

    Most regular verbs work this way Singular Plural First Person I like bananas. We like bananas. Second Person You like bananas. Third Person He/she/it likes bananas. They like bananas. Jessie is/are pretty. The Goatherds yodel/yodels loudly. The boy love/loves the...

    #19 A thank-you wall. Put a wallwisher on the old computer in the staffroom. Use it to post thank-you notes for TAs and members of staff or just to say what a great job the people in your school are...
  • 30 For years, Oxford Tutorial College has been

    30 For years, Oxford Tutorial College has been

    30+ years in education with a focus on teaching, managing and looking after students' well being and safety. ... One-stop 'hub' for students. ... University of Leeds . University of Surrey. University of East Anglia. University of Birmingham. University of...
  • Trend Report,

    Trend Report,

    Top-tier innovators rely on our 1 week turnaround of custom idea-filled reports for an unbeatable price. With our unparalleled platform, we more-efficiently help you: 1) explore new markets by getting us to scour the world for ideas. 2) track competitor...
  • Get Yourself Thinking… -

    Get Yourself Thinking… -

    Quick Detour: Leibniz's Law. The arguments put forward by Descartes all use a variation of Leibniz's Law (named after Gottfried Leibniz 1646-1716): Leibniz's Law (also called the Indiscernibility of Identicals): If . two things have all the same properties, then...
  • Recognizing Textual Entailment - IIT Bombay

    Recognizing Textual Entailment - IIT Bombay

    Antonym features. Presence/absence of antonymous word in T and H "Dark knight is falling" ⇏ "Dark knight is rising" Adjunct features. Dropping/adding of syntactic adjunct when moving from T to H "He is running fast" => "He is running" Entailment...