CS239-Lecture 7UDF Optimizations - CS | Computer Science

CS239-Lecture 7UDF Optimizations - CS | Computer Science

CS239-Lecture 7 UDF Optimizations Madan Musuvathi Visiting Professor, UCLA Principal Researcher, Microsoft Research Schedule update Approximation May 2 Guest Lecture Andy Konwinski of Databricks May 4 a) BlinkDB

: Queries with Bounded Errors and Bounded Response Times on Very Large Data b) Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters Graph Analytics May 9 a) Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud b) Scalability! But at what COST? May a) Latency-Tolerant Software Distributed Shared Memory Zhaoxing Bu

Qi Zhao Tushar Sudhakar Jee Zhouyihai Ding David Hong 11 b) Arabesque: A System for Distributed Graph Mining Curation & Transactions May a)

Wrangler: Interactive Visual Specication of Data Transformation Scripts Yushu Qin b) May a) Saswat Padhi Lun Liu Sakshi Goplani

16 Data Curation at Scale: The Data Tamer System 18 No compromises: distributed transactions with consistency, availability, and performa Nisarg Shah b) nceType-Aware Transactions for Faster Concurrent Code Large Scale Machine Learning + Project Presentations

May a) Liqiang Yu Large Scale Distributed Deep Networks 23 b) Scaling Distributed Machine Learning with the Parameter Server May Project Presentations 25 Hao Chen

Paper presentations read your paper well before the presentation (ideally, at least a week before) - consult with me on the difficult parts assume everyone has read the paper - your goal is to quickly summarize the paper - bring out the key points for discussion in the class - cull insights about the topic by reading related work plan on 15-20 minute talk with 25 minutes discussion

Pop Quiz The State of California recently decided to charge sales taxes on Amazon shipments to California residents. The following query calculates the additional taxes incurred by this change: purchases: a set of (user_id, purchased_product, price_before_taxes) addresses: a set of (user_id, address, zipcode) A = join(purchases, addresses, purchases.user_id == addresses.user_id) B = reduce(A, user_id, sum(price_before_taxes)) C = lter(B, if_in_california(zipcode)) Output C Name two optimizations that you can perform on this query. Justify why these optimizations will

produce a faster plan Opening the Black Boxes in Data Flow Optimization Yi Ding [email protected] Overview What is a Black Box? User Define Functions!

What does open mean? Static Code Analysis! No Knowledge of Algebraic Properties! Background Stratosphere: Nephele engine + PACT complier PACT Map-Reduce

Map, Reduce, Cross, Match, CoGroup Map, Reduce Arbitrary acyclic data flow graphs MapReduce Operator: Second-Order Function + First-Order User-Defined Function(UDF).

Background Example 1 Reorder! is better!

Wrong answer! We dont have any knowledge of , and ! Review Example Problem!!!!! 2, 3 1 2, 3 2 2, 3 3 5, 3

Wrong answer! Conditions Global record: Read set: all attributes that might influence the operators output. Write set: all attributes whose values change after the operator.

= { } , ={ } , ={ } , = , ={ , } , ={ } , 1 1 2 2 3

3 Conflict one attribute appears in both operators write set or in one operators read set and in anothers write set. Conflict!!! Condition ROC Condition Example 2

= = , = { , } , ={ , } ROC Holds! Condition KGP condition function , when applied to a set of records with the same value for , either emits or filters all these records.

Together with ROC & KGP, we can determine almost all UDFs whether they can be reordered! hh ( hh ( , ) , ) hh ( ( ) , ) ( ( ) )

( ) = ( ( ) ) ( ( ) ) ) =

( ( ( ) )

( hh ( , ) ) hh ( , hh ( , ) ) Implementation

Read set Write set , Enumeration Evaluation Relational Online Analysis Processing Text mining Non-relational Clickstream Processing New Model:

Cost Model: Static Code Analysis, Enumeration Network IO, Disk IO, CPU Cost Evaluation

Future Work Enumeration costs time To identify beneficial reorderings Wider range of optimizations Semantic information Spotting Code Optimization in Data Parallel Pipelines through PeriSCOPE MUHAMMAD ALI GULZAR Outline

Introduction Running Example Optimizations using PeriSCOPE Column Reduction Early Filtering Smart Cut Evaluation Data Shuffling is Expensive

Disconnect b/w Pipeline and Code Optimizers Pipeline Optimizers Logical optimizers work on high level pipeline topology Unable to consider procedural code (black box) while optimizing Compiler Optimizations Works only on user defined functions Unaware of the pipeline -> No inter-stage optimizations Manual Optimizations Tedious

Error-prone Spark, MapReduce etc , being unaware of the UDFs, apply only logical optimizations on the workflow Example If filter GetLength(query) > 4 was written in the procedural code PScoreReducer then this optimization could not be performed PeriSCOPE -- Overview Optimizes procedural code of SCOOPE programs

Connects data-flow of the program to the high-level pipeline Searches for code candidates susceptible to optimizations Three core optimizations: Column reduction Early filtering Smart cut Running Example

Column Reduction Problem Unused column in data parallel programs consume high network I/O Well established concept i.e. Early Projection and also applied on procedural code (UDFs) [Manimal '11] Solution by PeriSCOPE Analyze dependency between input and output columns of all operators Remove unused columns and associated computations Limitation in PeriSCOPE's column reduction

No column reduction can be applied if the column name is unresolved Column Reduction in Example Code Motion

Moving code from user defined functions across stages Goal is to reduce the number and size of inter stage rows. Program correctness has to be maintained --- Safe Code Motion Safety Rules Rule 1: Must not move statement if it generates shuffle-key column Rule 2: Must not move stateful statement across shuffle Rule 3: Must not filtering statement if it is, or is reachable from , a stateful statement Perform loop dependency analysis to confront dependencies in different iterations of loop

Early Filtering Filtering earlier reduces the I/O burden on shuffling Similar to Early Selection Related works [Manimal etc] perform the same optimizationbut can not rewrite UDFs How ?

First UDF after shuffling is analyzed for filtering statements Stateful filtering statements are removed Backward slice to collect the filter and the statements that can affect it Removal and addition of code slice into appropriate earlier location (Rewriting UDFs) Early Filtering -- Example Smart Cut Minimize cross stage data flow Similar to combiners but more fine grained instruction level How ?

Move row-local computations upstream to leverage parallelism Builds a data dependency graph -- Vertices are instructions, Edges are RAW Edge weight is name and size of the dependent variable Applies a minimum cut algorithm (one directional, same variables are counted once) Smart Cut -- Example Evaluation Optimization coverage evaluation on 28K jobs Nature and type of jobs are not clear

Case Study Performance improvements I/O reduction at shuffling phase Thoughts

Intelligent tool to perform optimizations from within UDFs. Not platform specific Can be generalized to other frameworks Early filtering and column reduction is already well established optimizations in Databases Column reduction can only be applied only to columnar data engines In the context of Spark, users are bound to use specific APIs designed for filters. Early filtering is not needed.

Recently Viewed Presentations

  • Evidence-Based Practice and Advocacy

    Evidence-Based Practice and Advocacy

    If so, actions are taken to put the change into practice, this is when a plan is developed PICO for practice question Patient, population, or problem Intervention Comparison with other treatments Outcomes Searching for Evidence PubMed (MEDLINE) CINAHL (The Cumulative...
  • More Large-Scale Machine Learning Perceptrons Support-Vector Machines Jeffrey

    More Large-Scale Machine Learning Perceptrons Support-Vector Machines Jeffrey

    Possibly w and do not exist, since there is no guarantee that the points are linearly separable. Example: Kernel Functions Can Linearize. Sometimes, we can transform points that are not linearly separable into a space where they are linearly separable.
  • Identifying Disadvantaged Children: Comparing Alternative Approaches Melissa Wong

    Identifying Disadvantaged Children: Comparing Alternative Approaches Melissa Wong

    Arial MS Pゴシック Calibri Times New Roman News Gothic MT Wingdings SPRC_visually_impaired_powerpoint Microsoft Office Excel Chart Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12...
  • IBM Services for ManagedSAP Applications

    IBM Services for ManagedSAP Applications

    IBM Managed Services for SAP Applications- service options. Standard. For clients that want IBMto manage SAP Basis and database, managed IaaS, OS and security runningon SAP-certified IaaSwith optional HA andDR for SAP applications
  • Charting the Course towards Permanency for Children in

    Charting the Course towards Permanency for Children in

    Collaborative Partnership of The Pennsylvania Child Welfare Training Program. The Pennsylvania Children and Youth Administrators. Department of Human Service's Office of Children, Youth and Families. The University of Pittsburgh, School of . Social Work. The Pennsylvania Child Welfare Training Program
  • The Incident Command System and National Incident Management

    The Incident Command System and National Incident Management

    Course Approval and Certification The California Governor's Office of Emergency Services (OES) has approved this course as a NIMS Integration Center "equivalent" course. Kai
  • gacollegecounseling.org

    gacollegecounseling.org

    It promotes beard and body hair growth. Male pattern baldness may develop. The clitoris increases slightly in size. Libido may be heightened. Muscle bulk increases. The voice deepens, but not usually to the pitch of other men. Periods will stop,...
  • Elements of Literature Sixth Course

    Elements of Literature Sixth Course

    The Restoration and the Eighteenth Century Introduction to the Literary Period Feature Menu Interactive Time Line Milestone: Cromwell and the Commonwealth