MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions Johanna Sommer Matthias Boehm Alexandre V. Evfimievski IBM Germany Graz University of Technology IBM Research Almaden Berthold Reinwald Peter J. Haas IBM Research Almaden UMass Amherst SIGMOD 2019 Amsterdam, Netherlands Research 16: Machine Learning Motivation Sparsity Estimation Ubiquitous Sparse Matrices Sparse inputs: NLP, graph analytics, recommenders, scientific computing

Data preprocessing: binning/recoding + one-hot encoding/feature hashing Sparse intermediates: selection predicates, dropout Transformation matrices: random reshuffling, sampling Automatic application of sparse operations Problem of Sparsity Estimation Compilation: memory and cost estimation (local vs distributed, rewrites, resource allocation, operator fusion) Runtime: format decisions and memory pre-allocation MM chain w/ n=20 matrices (1.8B plans) 99x sparsityunaware A Case for Exploiting Structural Properties Use case NLP Sentence Encoding Encoding a sequence of words (padded to max sentence length) into sequence of word embeddings SentenceCNN in IBM Watson C&C 1) Matrix multiplication Structural property:

1 non-zero per row exact sparsity estimate possible 2) Matrix reshape Research Question: Principled but practical sparsity estimation? Existing MM Sparsity Estimators Assumptions A1: No Cancellation Errors A2: No Not-A-Number (NaN*0=NaN) Common assumptions Boolean matrix product m x n by n x l () ^ =1 (1 ) #1 Nave Metadata Estimators Derive the output sparsity solely from the sparsity of inputs (e.g., SystemML) ^ =min ( 1, ) min(1, )

#2 Nave Bitset Estimator Convert inputs to bitsets and perform Boolean mm Examples: SciDB [SSDBM11], NVIDIA cuSparse, Intel MKL #3 Sampling Take a sample of aligned columns of A and rows of B Sparsity estimated via max of count-products Examples: MatFast [ICDE17], improvements in paper Existing MM Sparsity Estimators, cont. #4 Density Map Store sparsity per b x b block (default b = 256) MM-like estimator (average case estimator for *, probabilistic propagation for +) Example: SpMacho [EDBT15], AT Matrix [ICDE16] #5 Layered Graph [J.Comb.Opt.98] Nodes: rows/columns in mm chain Edges: non-zeros connecting rows/columns Assign r-vectors ~ exp and propagate via min Design Goals #1 Exploitation of structural properties (e.g., 1 non-zero per row, row sparsity) #2 Practical sparsity estimator #3 Support for matrix expressions MNC (Matrix Non-zero Count) Sketch #1 Row/Column NNZs

Count vectors for nnz per row/col hr = rowSums(A != 0) hc = colSums(A != 0) #2 Extended Row/Column NNZs her = rowSums((A!=0)*(hc=1)) hec = colSums((A!=0)*(hr=1)) #3 Summary Statistics Max nnz per row/column # of non-empty rows/columns, # of half-full rows/columns Construction: O(nnz(A)) 1st pass: compute basic row/column count vectors 2nd pass: compute extended row/column count vectors, summary stats MNC Sparsity Estimation Basic Estimator 1 7 3 DensityMap-like estimator over column/rows slices A #1 Exact Sparsity Estimation if Under assumptions A1 and A2 exact sparsity estimation possible Intuition: dot product of counts because no aggregation / collisions

999 max ( h ) =1 1 1 1 1 010 2011 109 3 3 3 0 sC = + + = (0*3 + 1*3 + 0*3 2*3 + 0*3 + 1*3 + 1*3 1*3 + 0*3 + 9*0) / 45 18 / 45 = 0.4 3 0 2

B MNC Sparsity Estimation Basic Estimator 1 7 3 DensityMap-like estimator over column/rows slices #1 Exact Sparsity Estimation if Under assumptions A1 and A2 exact sparsity estimation possible Intuition: dot product of counts because no aggregation / collisions #2 Lower and Upper Bounds Lower bound: clipping of bad estimates Upper bound: improved estimates Intuition: 0 and dim/2 counts exactly describe part of the target area #3 Exploiting Extended NNZ Counts Intuition: Compose estimate from exactly known and estimated quantities (i.e., number non-zeros) A 3 0

2 B MNC Sketch Propagation Basic Sketch Propagation Estimate sparsity Propagate and scaled to sparsity E.g., Assumption: Structure preserving Probabilistic Rounding Ultra-sparse matrices basic rounding would introduce bias w/ probability of rounding up Exact Sketch Propagation If A or B are fully diagonal propagate or , respectively MNC Additional Operations Criticism: Long chains of pure matrix products rare in ML workloads #1 Reorganization Operations Operations that change position of non-zero values Examples: transpose, reshape, diag, cbind and rbind #2 Element-wise Operations Cell-wise operations with and without broadcasting

Examples: A=0, A!=0, A+B, A*B Sparsity Estimation and Sketch Propagation in the paper SparsEst: Sparsity Estimation Benchmark B1 Structured Matrix Products (Struct) B2 Real Matrix Operations (Real) Name B1.1 NLP B1.2 Scale B1.3 Perm B1.4 Outer B1.5 Inner Expression Data XW Syn1 diag() X) X Syn2 table(s1,s2) X Syn2 CR Syn3/Syn4 RC Syn4/Syn3 Name

B2.1 NLP B2.2 Project B2.3 CoRefG B2.4 EmailG B2.5 Mask Expression XW XP G t(G) GG M*X B3 Real Matrix Expressions (Chain) Name Expression Data B3.1 NLP reshape(X W) AMin A B3.2 Scale&Shift t(S) t(X) diag(w) X S B Mnist1m B3.3 Graph PGGGG AMin R B3.4 Recomm. (P X != 0) * (P L t(R)) Amazon B3.5 Predicate X * (R * S + T) != 0

Mnist1m Data AMin A Cov AMin R Email Mnist1m 25.1M x 2.5M, 3.9e-7 1M x 784, 0.25 3.1M x 3.1M, 2.6e-6 8M x 2.3M, 1.2e-6 Experimental Setup HW and SW Setting 2+10 node cluster w/ 2-socket Xeon E5-2620 (24 vcores, 128GB memory) CentOS 7.2, OpenJDK 1.8.0_151, Apache Hadoop 2.7.3 with 80GB heap Baselines (available at https://github.com/apache/systemml) FP64 dense / sparse matrix multiplication (ground truth) Meta data estimators: MetaWC, MetaAC Bitset estimator: Bitset

Sampling-based estimator: Sample (f = 5%) Density map estimator: DMap (b = 256, FP64) Layered graph estimator: LGraph (r = 32) MNC sparsity estimator: MNC basic, MNC Single-threaded estimators (multi-threaded experiments in the paper) Baselines for Word Embeddings Primitives in TensorFlow and PyTorch only return dense tensors Bitset/Dmap Construction and Estimation densifying MNC close to Sampling [20K,20K] x [20K,20K] [20K,20K] Construction O(nnz(A)) [10K,x] x [x,10K] [10K,10K] Estimation

Construct once, estimate many times Accuracy SparsEst Benchmark B1.1 NLP ct a Ex B2.1 NLP ct a Ex B1.3 Perm ct a Ex B1.4 Outer t ac x E B1 Struct

B2.2 Project t ac x E B2.3 CoRefG B2 Real ct a Ex B3 Chain B3.1 NLP ct a Ex B3.4 Rec B3.5 Pred But: Also negative results e.g., for B3.3 with

vanishing structure (in the paper) B1.5 Inner ct a Upper Ex bound B2.4 EmailG Conclusions Summary Analysis of existing sparsity estimators MNC Sketch: simple, count-based sketch for matrix expressions SparsEst: new sparsity estimation benchmark (struct, real, chains) Conclusions Exploitation of structural properties good accuracy at low overhead Versatile MNC sketch broadly applicable in ML systems Lots of future work: bounds, operations, optimizer integration, distributed Available open source in SystemML/SystemDS https://github.com/apache/systemml https://github.com/tugraz-isds/systemds Backup: Accuracy B3.2 All Intermediates Scale Background: Deferred Standardization Mean subtraction is densifying

Substitute standardized X with X %*% S and optimize mm chain S X Shift Neg. column means Dense output if colMeans non-zero t(X) %*% diag(w) %*% X %*% B Accuracy B3.2 t(S) %*% t(X) %*% diag(w) %*% X %*% S %*% B Mnist1m MNC Sketch Density Map MMChain Opt Memo Table