Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science Drexel University www.spiral.net Motivation On modern machines operation count is not always the most important performance metric. Effective utilization of the memory hierarchy, pipelining, and Instruction Level Parallelism is important, and it is not easy to determine such utilization from source code. Automatic Performance Tuning and Architecture Adaptation Generate and Test FFT, Matrix Multiplication, Explain performance distribution LACSI 2006 Automatic Tuning of Libraries and Applications Outline Space of WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache

LACSI 2006 Automatic Tuning of Libraries and Applications Walsh-Hadamard Transform y = WHTN x, WHT N N = 2n n WHT2 ... WHT2 WHT WHT WHT 4 2 1 1 1 1 1 1 1 1 1 1 WHT2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 LACSI 2006 Automatic Tuning of Libraries and Applications Factoring the WHT Matrix AC DCD A A C) = (A C Im nmn 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 1 1 WHT 4 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 0 1 0 0 0 0 0

0 1 1 1 1 WHT2 WHT2WHT2WHT2 LACSI 2006 Automatic Tuning of Libraries and Applications Recursive Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (WHT2 I4)(I2 (WHT2 I2) (I2 WHT2)) LACSI 2006 Automatic Tuning of Libraries and Applications 1 1 1 1

Iterative Algorithm 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (WHT2 I4)(I2 WHT2 I2) (I4 WHT2)) LACSI 2006 Automatic Tuning of Libraries and Applications 1

1 1 1 1 1 1 1 WHT Algorithms Recursive WHT N WHT2 IN / 2 I2WHT N / 2 Iterative n WHTN I2 i 1 i 1 WHT I n i 2 2 General t WHT2 I2 n

n1ni 1 i 1 where WHT n 2i I ni1nt , 2 n n1 nt LACSI 2006 Automatic Tuning of Libraries and Applications WHT Implementation N = N1* N2**Nt x = WHTN*x x Ni=2ni M =(x(b),x(b+s),x(b+(M-1)s)) b,s Implementation(nested loop) R=N; S=1; for i=t,,1 R=R/Ni for j=0,,R-1 for k=0,,S-1 x NjNi i S k , S WHTNi x NjNi i S k ,S S=S* Ni; t WHT2n (I 2 n1+ + ni-1WHT2n iI 2 ni+1+ + nt) i1

LACSI 2006 Automatic Tuning of Libraries and Applications Partition Trees 9 Left Recursive 4 3 4 1 1 2 1 3 2 1 1 Right Recursive 4 2 1 1 1 1 Balanced 1 1 1 2 1 1

2 1 1 LACSI 2006 Automatic Tuning of Libraries and Applications 2 1 4 Iterative 4 1 3 1 1 Number of Algorithms 1 T , n 1 T n n 1 t n n n Tn 1 t 1, n 1 T( z) T z z n n n 0

T( z) T 3 4 2 z 6 z 24 z 1 1 8z 8 z z T( z ) (1 z ) (1 T( z )) 2( 2 2 z ) n n 2 3/ 2 ( / n ), 6.8 LACSI 2006 Automatic Tuning of Libraries and Applications 2 Outline WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache

LACSI 2006 Automatic Tuning of Libraries and Applications WHT Package Pschel & Johnson (ICASSP 00) Allows easy implementation of any of the possible WHT algorithms Partition tree representation W(n)=small[n] | split[W(n1),W(nt)] Tools Measure runtime of any algorithm Measure hardware events (coupled with PCL/PAPI) Search for good implementation Dynamic programming Evolutionary algorithm LACSI 2006 Automatic Tuning of Libraries and Applications Algorithm Comparison Rec &Bal/It Instruction Count 2.00E+00 1.80E+00 1.60E+00 1.40E+00 1.20E+00 2.5 2 r1/i1 1.00E+00 8.00E-01 6.00E-01 4.00E-01 2.00E-01 0.00E+00 1.5 rr1/i1

1 lr1/i1 bal1/i1 0.5 19 17 15 13 11 9 7 1 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 ratio Recursive/Iterative Runtime WHT size(2^n) Rec&It/Best Runtime Small/It Runtime 1.20E+01 r1/b r3/b i1/b i3/b b/b ratio

8.00E+00 6.00E+00 4.00E+00 2.00E+00 ratio 1.00E+01 1.20E+01 1.00E+01 8.00E+00 6.00E+00 4.00E+00 2.00E+00 0.00E+00 I_1/rt r_1/rt 1 0.00E+00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 2 3 4 5 WHT size(2^n) WHT size(2^n) LACSI 2006 Automatic Tuning of Libraries and Applications 6 7 8 Cache Miss Data Recursive vs. Iterative

Recursive vs. Iterative Normalized to Best 1.60E+00 1.40E+00 1.00E+01 8.00E+00 Recursive Time 6.00E+00 Iterative Time 4.00E+00 2.00E+00 Ratio Recursive/Iterative 1.20E+00 Instructions 1.00E+00 L1 Data Cache Misses 8.00E-01 6.00E-01 L2 Data Cache Misses 4.00E-01 2.00E-01 22 19 16 size size Recursive vs. Best Iterative vs. Best

6.00E+00 9.00E+00 7.00E+00 Instructions 6.00E+00 5.00E+00 4.00E+00 L1 Data Cache Misses 3.00E+00 L2 Cache Misses 2.00E+00 1.00E+00 Ratio Recursive/Iterative 8.00E+00 5.00E+00 Instructions 4.00E+00 3.00E+00 L1 Data Cache Misses 2.00E+00 L2 Cache Misses 1.00E+00 size 22 19 16

13 10 7 4 22 19 16 13 7 4 10 size 1 0.00E+00 0.00E+00 1 Ratio Iterative/Best 13 10 1 22 19 16 13 10

7 4 1 7 0.00E+00 0.00E+00 4 Ratio Alg Time/Best Time 1.20E+01 LACSI 2006 Automatic Tuning of Libraries and Applications Histogram (n = 16, 10,000 samples) Wide range in performance despite equal number of arithmetic operations (n2n flops) Pentium III vs. UltraSPARC II LACSI 2006 Automatic Tuning of Libraries and Applications Outline WHT Algorithms WHT Package and Performance Distribution Performance Model Instruction Count Cache LACSI 2006 Automatic Tuning of Libraries and Applications WHT Implementation N = N1* N2Nt x = WHTN*x x

Ni=2ni M =(x(b),x(b+s),x(b+(M-1)s)) b,s Implementation(nested loop) R=N; S=1; for i=t,,1 R=R/Ni for j=0,,R-1 for k=0,,S-1 x NjNi i S k , S WHTNi x NjNi i S k ,S S=S* Ni; t WHT2n (I 2 n1+ + ni-1WHT2n iI 2 ni+1+ + nt) i1 LACSI 2006 Automatic Tuning of Libraries and Applications Instruction Count Model 3 IC(n) A(n) i 1 3 ( n) l A ( n) L i i l 1 l A(n) = number of calls to WHT procedure = number of instructions outside loops Al(n) = Number of calls to base case of size l l = number of instructions in base case of size l Li = number of iterations of outer (i=1), middle (i=2), and

outer (i=3) loop i = number of instructions in outer (i=1), middle (i=2), and outer (i=3) loop body LACSI 2006 Automatic Tuning of Libraries and Applications Small[1] .file "s_1.c" .version "01.01" gcc2_compiled.: .text .align 4 .globl apply_small1 .type apply_small1,@function apply_small1: movl 8(%esp),%edx //load stride S to EDX movl 12(%esp),%eax //load x array's base address to EAX fldl (%eax) // st(0)=R7=x[0] fldl (%eax,%edx,8) //st(0)=R6=x[S] fld %st(1) //st(0)=R5=x[0] fadd %st(1),%st // R5=x[0]+x[S] fxch %st(2) //st(0)=R5=x[0],s(2)=R7=x[0]+x[S] fsubp %st,%st(1) //st(0)=R6=x[S]-x[0] ????? fxch %st(1) //st(0)=R6=x[0]+x[S],st(1)=R7=x[S]-x[0] fstpl (%eax) //store x[0]=x[0]+x[S] fstpl (%eax,%edx,8) //store x[0]=x[0]-x[S] ret LACSI 2006 Automatic Tuning of Libraries and Applications Recurrences t n ni A(

A(n) 1 2 i 1 n ), n n ... n i 1 t n a leaf n l Al (n) l 2 , where l number of leaves l n n ( n ) t 2 L ( n ), n n ... n L A(n) 0, t 1 1 i 1 t n ni L ( n ) 2 2 i i 1 i

... n 1 n i 1 , L2 ( ni) 2 t n ni n ni , ( n ) ( ) 2 L2 ni 2 L3 i 1 L (n) 0, n i 1 t n n ... n 1 t n n ... n 1 a leaf LACSI 2006 Automatic Tuning of Libraries and Applications t Histogram using Instruction Model (P3) l = 12, l = 34, and l = 106 = 27 1 = 18, 2 = 18, and 1 = 20 LACSI 2006 Automatic Tuning of Libraries and Applications

Cache Model Different WHT algorithms access data in different patterns All algorithms with the same set of leaf nodes have the same number of memory accesses Count misses for accesses to data array Parameterized by cache size, associativity, and block size simulate using program traces (restrict to data vector accesses) Analytic formula? LACSI 2006 Automatic Tuning of Libraries and Applications Blocked Access 4 1 1 0 WHT16 (WHT2 I 8 )( I 2 WHT8 ) 3 1 (WHT2 I 8 )( I 2 (WHT2 I 4 )( I 2 WHT4 )) 2 2 3 4 5 6 7 8 9 10 11

12 13 LACSI 2006 Automatic Tuning of Libraries and Applications 14 15 Interleaved Access 4 WHT16 (WHT8 I 2 )( I 8 WHT2 ) 3 2 0 ((WHT4 I 2 )( I 4 WHT2 ) I 2 )( I 8 WHT2 ) 1 1 1 2 3 4 5 6 7 8 9 10 11

12 LACSI 2006 Automatic Tuning of Libraries and Applications 13 14 15 Cache Simulator 4 3 2 4 1 1 1 3 1 144 memory accesses C = 4, A = 1, B = 1 (80, 112) C = 4, A = 4, B = 1 (48, 48) C = 4, A = 1, B = 2 (72, 88) Iterative vs. Recursive (192 memory accesses) C = 4, A = 1, B = 1 (128, 112) LACSI 2006 Automatic Tuning of Libraries and Applications 2 Cache Misses as a Function of Cache Size C=22 C=24

C=23 C=25 LACSI 2006 Automatic Tuning of Libraries and Applications Formula for Cache Misses M(L,WN,R) = Number of misses for LWHTN R l r M(2 ,W N , 2 ) N if r N C/ 2 3N if W N is a leaf t i 1 2n niM(n1 ... ni 1 ,W Ni , r ni ... nt) LACSI 2006 Automatic Tuning of Libraries and Applications Closed Form M(L,WN,R) = Number of misses for LWHTN R M(0,W_n,0) = 3(n-c)*2n + k*2n C = 2c, k = number of parts in the rightmost c positions c = 3, n = 4

Iterative Balanced 4 1 1 4 4 1 k=3 Right Recursive 1 2 1 1 2 1 1 1 k=2 LACSI 2006 Automatic Tuning of Libraries and Applications k=1 3 1 2 1 1 Summary of Results and Future Work

Instruction Count Model min, max, expected value, variance, limiting distribution Cache Model Direct mapped (closed form solution, distribution, expected value, and variance) Combine models Extend cache formula to include A and B Use as heuristic to limit search and predict performance LACSI 2006 Automatic Tuning of Libraries and Applications Sponsors Work supported by DARPA (DSO), Applied & Computational Mathematics Program, OPAL, through grant managed by research grant DABT63-98-1-0004 administered by the Army Directorate of Contracting, DESA: Intelligent HW-SW Compilers for Signal Processing Applications, and NSF ITR/ NGS #0325687: Intelligent HW/SW Compilers for DSP. LACSI 2006 Automatic Tuning of Libraries and Applications