SPIRAL: Automatic Implementation of Signal Processing Algorithms

SPIRAL: Automatic Implementation of Signal Processing Algorithms

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University www.spiral.net Motivation On modern machines operation count is not always the most important performance metric. Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code. Automatic Performance Tuning and Architecture Adaptation Generate and Test FFT, Matrix Multiplication, Explain performance distribution LACSI 2006 Automatic Tuning of Libraries and Applications Outline Space of WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache

LACSI 2006 Automatic Tuning of Libraries and Applications Walsh-Hadamard Transform y = WHTN x, WHT N N = 2n n WHT2 ... WHT2 WHT WHT WHT 4 2 1 1 1 1 1 1 1 1 1 1 WHT2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 LACSI 2006 Automatic Tuning of Libraries and Applications Factoring the WHT Matrix AC DCD A A C) = (A C Im nmn 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 WHT 4 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0

0 1 1 1 1 WHT2 WHT2WHT2WHT2 LACSI 2006 Automatic Tuning of Libraries and Applications Recursive Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (WHT2 I4)(I2 (WHT2 I2) (I2 WHT2)) LACSI 2006 Automatic Tuning of Libraries and Applications 1 1 1 1

Iterative Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (WHT2 I4)(I2 WHT2 I2) (I4 WHT2)) LACSI 2006 Automatic Tuning of Libraries and Applications 1

1 1 1 1 1 1 1 WHT Algorithms Recursive WHT N WHT2 IN / 2 I2WHT N / 2 Iterative n WHTN I2 i 1 i 1 WHT I n i 2 2 General t WHT2 I2 n

n1ni 1 i 1 where WHT n 2i I ni1nt , 2 n n1 nt LACSI 2006 Automatic Tuning of Libraries and Applications WHT Implementation N = N1* N2**Nt x = WHTN*x x Ni=2ni M =(x(b),x(b+s),x(b+(M-1)s)) b,s Implementation(nested loop) R=N; S=1; for i=t,,1 R=R/Ni for j=0,,R-1 for k=0,,S-1 x NjNi i S k , S WHTNi x NjNi i S k ,S S=S* Ni; t WHT2n (I 2 n1+ + ni-1WHT2n iI 2 ni+1+ + nt) i1

LACSI 2006 Automatic Tuning of Libraries and Applications Partition Trees 9 Left Recursive 4 3 4 1 1 2 1 3 2 1 1 Right Recursive 4 2 1 1 1 1 Balanced 1 1 1 2 1 1

2 1 1 LACSI 2006 Automatic Tuning of Libraries and Applications 2 1 4 Iterative 4 1 3 1 1 Number of Algorithms 1 T , n 1 T n n 1 t n n n Tn 1 t 1, n 1 T( z) T z z n n n 0

T( z) T 3 4 2 z 6 z 24 z 1 1 8z 8 z z T( z ) (1 z ) (1 T( z )) 2( 2 2 z ) n n 2 3/ 2 ( / n ), 6.8 LACSI 2006 Automatic Tuning of Libraries and Applications 2 Outline WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache

LACSI 2006 Automatic Tuning of Libraries and Applications WHT Package Pschel & Johnson (ICASSP 00) Allows easy implementation of any of the possible WHT algorithms Partition tree representation W(n)=small[n] | split[W(n1),W(nt)] Tools Measure runtime of any algorithm Measure hardware events (coupled with PCL/PAPI) Search for good implementation Dynamic programming Evolutionary algorithm LACSI 2006 Automatic Tuning of Libraries and Applications Algorithm Comparison Rec &Bal/It Instruction Count 2.00E+00 1.80E+00 1.60E+00 1.40E+00 1.20E+00 2.5 2 r1/i1 1.00E+00 8.00E-01 6.00E-01 4.00E-01 2.00E-01 0.00E+00 1.5 rr1/i1

1 lr1/i1 bal1/i1 0.5 19 17 15 13 11 9 7 1 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 ratio Recursive/Iterative Runtime WHT size(2^n) Rec&It/Best Runtime Small/It Runtime 1.20E+01 r1/b r3/b i1/b i3/b b/b ratio

8.00E+00 6.00E+00 4.00E+00 2.00E+00 ratio 1.00E+01 1.20E+01 1.00E+01 8.00E+00 6.00E+00 4.00E+00 2.00E+00 0.00E+00 I_1/rt r_1/rt 1 0.00E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 3 4 5 WHT size(2^n) WHT size(2^n) LACSI 2006 Automatic Tuning of Libraries and Applications 6 7 8 Cache Miss Data Recursive vs. Iterative

Recursive vs. Iterative Normalized to Best 1.60E+00 1.40E+00 1.00E+01 8.00E+00 Recursive Time 6.00E+00 Iterative Time 4.00E+00 2.00E+00 Ratio Recursive/Iterative 1.20E+00 Instructions 1.00E+00 L1 Data Cache Misses 8.00E-01 6.00E-01 L2 Data Cache Misses 4.00E-01 2.00E-01 22 19 16 size size Recursive vs. Best Iterative vs. Best

6.00E+00 9.00E+00 7.00E+00 Instructions 6.00E+00 5.00E+00 4.00E+00 L1 Data Cache Misses 3.00E+00 L2 Cache Misses 2.00E+00 1.00E+00 Ratio Recursive/Iterative 8.00E+00 5.00E+00 Instructions 4.00E+00 3.00E+00 L1 Data Cache Misses 2.00E+00 L2 Cache Misses 1.00E+00 size 22 19 16

13 10 7 4 22 19 16 13 7 4 10 size 1 0.00E+00 0.00E+00 1 Ratio Iterative/Best 13 10 1 22 19 16 13 10

7 4 1 7 0.00E+00 0.00E+00 4 Ratio Alg Time/Best Time 1.20E+01 LACSI 2006 Automatic Tuning of Libraries and Applications Histogram (n = 16, 10,000 samples) Wide range in performance despite equal number of arithmetic operations (n2n flops) Pentium III vs. UltraSPARC II LACSI 2006 Automatic Tuning of Libraries and Applications Outline WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache LACSI 2006 Automatic Tuning of Libraries and Applications WHT Implementation N = N1* N2Nt x = WHTN*x x

Ni=2ni M =(x(b),x(b+s),x(b+(M-1)s)) b,s Implementation(nested loop) R=N; S=1; for i=t,,1 R=R/Ni for j=0,,R-1 for k=0,,S-1 x NjNi i S k , S WHTNi x NjNi i S k ,S S=S* Ni; t WHT2n (I 2 n1+ + ni-1WHT2n iI 2 ni+1+ + nt) i1 LACSI 2006 Automatic Tuning of Libraries and Applications Instruction Count Model 3 IC(n) A(n) i 1 3 ( n) l A ( n) L i i l 1 l A(n) = number of calls to WHT procedure = number of instructions outside loops Al(n) = Number of calls to base case of size l l = number of instructions in base case of size l Li = number of iterations of outer (i=1), middle (i=2), and

outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body LACSI 2006 Automatic Tuning of Libraries and Applications Small[1] .file "s_1.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl apply_small1 .type apply_small1,@function apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret LACSI 2006 Automatic Tuning of Libraries and Applications Recurrences t n ni A(

A(n) 1 2 i 1 n ), n n ... n i 1 t n a leaf n l Al (n) l 2 , where l number of leaves l n n ( n ) t 2 L ( n ), n n ... n L A(n) 0, t 1 1 i 1 t n ni L ( n ) 2 2 i i 1 i

... n 1 n i 1 , L2 ( ni) 2 t n ni n ni , ( n ) ( ) 2 L2 ni 2 L3 i 1 L (n) 0, n i 1 t n n ... n 1 t n n ... n 1 a leaf LACSI 2006 Automatic Tuning of Libraries and Applications t Histogram using Instruction Model (P3) l = 12, l = 34, and l = 106 = 27 1 = 18, 2 = 18, and 1 = 20 LACSI 2006 Automatic Tuning of Libraries and Applications

Cache Model Different WHT algorithms access data in different patterns All algorithms with the same set of leaf nodes have the same number of memory accesses Count misses for accesses to data array Parameterized by cache size, associativity, and block size simulate using program traces (restrict to data vector accesses) Analytic formula? LACSI 2006 Automatic Tuning of Libraries and Applications Blocked Access 4 1 1 0 WHT16 (WHT2 I 8 )( I 2 WHT8 ) 3 1 (WHT2 I 8 )( I 2 (WHT2 I 4 )( I 2 WHT4 )) 2 2 3 4 5 6 7 8 9 10 11

12 13 LACSI 2006 Automatic Tuning of Libraries and Applications 14 15 Interleaved Access 4 WHT16 (WHT8 I 2 )( I 8 WHT2 ) 3 2 0 ((WHT4 I 2 )( I 4 WHT2 ) I 2 )( I 8 WHT2 ) 1 1 1 2 3 4 5 6 7 8 9 10 11

12 LACSI 2006 Automatic Tuning of Libraries and Applications 13 14 15 Cache Simulator 4 3 2 4 1 1 1 3 1 144 memory accesses C = 4, A = 1, B = 1 (80, 112) C = 4, A = 4, B = 1 (48, 48) C = 4, A = 1, B = 2 (72, 88) Iterative vs. Recursive (192 memory accesses) C = 4, A = 1, B = 1 (128, 112) LACSI 2006 Automatic Tuning of Libraries and Applications 2 Cache Misses as a Function of Cache Size C=22 C=24

C=23 C=25 LACSI 2006 Automatic Tuning of Libraries and Applications Formula for Cache Misses M(L,WN,R) = Number of misses for LWHTN R l r M(2 ,W N , 2 ) N if r N C/ 2 3N if W N is a leaf t i 1 2n niM(n1 ... ni 1 ,W Ni , r ni ... nt) LACSI 2006 Automatic Tuning of Libraries and Applications Closed Form M(L,WN,R) = Number of misses for LWHTN R M(0,W_n,0) = 3(n-c)*2n + k*2n C = 2c, k = number of parts in the rightmost c positions c = 3, n = 4

Iterative Balanced 4 1 1 4 4 1 k=3 Right Recursive 1 2 1 1 2 1 1 1 k=2 LACSI 2006 Automatic Tuning of Libraries and Applications k=1 3 1 2 1 1 Summary of Results and Future Work

Instruction Count Model min, max, expected value, variance, limiting distribution Cache Model Direct mapped (closed form solution, distribution, expected value, and variance) Combine models Extend cache formula to include A and B Use as heuristic to limit search and predict performance LACSI 2006 Automatic Tuning of Libraries and Applications Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting, DESA: Intelligent HW-SW Compilers for Signal Processing Applications, and NSF ITR/ NGS #0325687: Intelligent HW/SW Compilers for DSP. LACSI 2006 Automatic Tuning of Libraries and Applications

Recently Viewed Presentations

  • Meeting the Challenge of Common Core: Planning Close

    Meeting the Challenge of Common Core: Planning Close

    Reader Response (Louise Rosenblatt): The readers' feelings/ meanings are what matters. New Criticism (Robert Penn Warren): The meaning is in the text and the text must be read closely to get it to give up its meaning. Many versions of...
  • Effects of Micronized Cartilage Matrix on Cartilage Repair

    Effects of Micronized Cartilage Matrix on Cartilage Repair

    Giza E, Sullivan M, Ocel D, Lundeen G, Mitchell ME, Veris L, et al. Matrix-induced autologous chondrocyte implantation of talus articular defects. Foot & ankle international/ American Orthopaedic Foot and Ankle Society [and] Swiss Foot and Ankle Society. 2010 Sep;...
  • Regular -er Verbs

    Regular -er Verbs

    AIMER is one of a large group of verbs in French called -er verbs. All these verbs end in -er in the infinitive. Can you think of any other -er verbs you know? All -er verbs share the same conjugation....
  • Fashion History B.C. to Present Times B.C.  Fashions

    Fashion History B.C. to Present Times B.C. Fashions

    1400-1500's the 15th & 16th Century. Fashion in this era was greatly influenced by the high class society and the European Kings and Queens. The Surcoat was a popular fashion accessory for the men of this era. Peasants had a...
  • ソフトウェア工学 - Oasis

    ソフトウェア工学 - Oasis

    Candidate measures must be selected, with clear links to the information needs. Measures must then be selected based on the priorities of the information needs and other criteria such as cost of collection, degree of process disruption during collection, ease...
  • Understanding Our Environment - Kenyon College

    Understanding Our Environment - Kenyon College

    Biotic Potential - Maximum reproductive rate of an organism (births minus deaths in optimal conditions). Carrying Capacity - Maximum number of individuals of any species that can be indefinitely supported. Growth to a Stable Population.
  • Slope of a Line - City University of New York

    Slope of a Line - City University of New York

    When the terms of a sequence are added, the resulting expression is a series. A series can be infinite or finite. Finite series: Infinite series: The sum of the first terms of the sequence is called the kth partial sum...
  • FINANCE BILL 2019 - baroda-icai.org

    FINANCE BILL 2019 - baroda-icai.org

    Goods and Services Tax. Procedure for Registration [Sec. 25] Undergo authentication or furnish Aadhaar No. Facility of Digital Payment [Sec. 31A] Online Invoicing - < Rs.5 Crores