Star Schema Primer - PASS

Star Schema Primer - PASS

Kimball Method Dimensional Modeling Clayton Groom Covenant Technology Partners Intro Clayton Founding Groom Partner of CTP [email protected] Twitter: cgroom BI professional for 15 years MCP BI, MS Visual Technology Specialist (VTS) Agenda Technical architecture Fundamentals Dimensional modeling rules

Conformed dimensions ETL considerations Questions Credits & Thanks Jim Ronan for getting me started Ralph Kimball for making the way clear Joy Mundy - from whom Ive shamelessly copied slides Operational vs. Analytic systems Operational System Analytic system Measurement of a Execution of a business process business process Purpose Primary Interaction Style Insert, Update, Query, Delete Scope of Interaction Individual transactions

Query Patterns Temporal Focus Design Optimization Design Principle Also Known as Query Aggregated transactions Unpredictable and Predictable and stable changing Current Current and historic High-performance Update Concurrency query Entity-Relationship (ER) Design Dimensional design in third normal form (3NF) (Star Schema or Cube) Transaction system On Line Transaction Processing Data Warehouse System (OLTP)

System Source System Data Mart Source: Star Schema, The complete reference By: Christopher Adamson The DW/BI Technical Architecture Metadata Standard Standard Reports Reports OLAP RDBMS Analytic Analytic Apps Apps (incl. (incl. Data

Data Mining) Mining) Business Business Users Users Ad Ad Hoc Hoc Queries Queries Presentatio Presentatio n n Servers Servers BI BI Portal Portal

Dimensionalization Data Quality Business/Extract Rules Source Systems ETL ETL Operational Operational BI BI Integration Integration Services Services RDBMS RDBMS

Analysi Analysi s s Service Service s s Reporting Reporting Services Services Data Data Mining Mining Visual Visual Studio Studio

2007 2007 Microsoft Microsoft Office Office system system SharePoint SharePoint Performance Performance Pnt Pnt Report Report Builder Builder

Kimball Method Fundamentals Focus on the business Build a dimensional data warehouse / business intelligence system Dimension tables such as Customer contain descriptive information Fact tables such as Sales contain detailed transactions, and link to the dimensions Dimension attribute changes are managed and conformed across the enterprise Excellent user experience is paramount Kimball Business Dimensional Lifecycle

Technical track Data track Project Planning BI Application track Business Requirements Definition Technical Architecture Design Product Selection & installation Dimensional Modelling Physical Design BI Application

Specification Growth ETL Design & Development BI Application Development Project Management Deployment Maintenance Relational Dimensional Model Product Date Date Key Date Month Name Year Month Calendar Qtr

other attributes Other dims Promotion Key PromoName Sales Rep Key PromoContact Sales Rep Name Store Key PromoCity Sales Rep Contact Store Name Sales Rep City Store Contact Store City Sales Fact Product Key Customer Key Date Key other keys Sales Amount Sales Quantity other measures

Product Key Product SKU Product Name Product Brand Product Family other attributes Customer Customer Key Account Number Customer Name Customer Zip Customer City other attributes Its not Rocket Science But there is a science to it The goal is to provide information to end users in a way that they dont have to be rocket scientists to use it.

Kimballs Ten Essential rules of Dimensional Modeling Rule #1: Load detailed atomic data into dimensional Structures Avoid summarizing data at all costs Users will want the details You can always roll up, but cant drill down if the detail is not there Kimballs Ten Essential rules of Dimensional Modeling Rule #2: Structure Dimensional models around business processes

Business processes capture metrics associated with measured business events Metrics translate into facts, stored in a processspecific atomic fact table Accounts Payable Accounts Receivable General ledger Orders Shipments Kimballs Ten Essential rules of Dimensional Modeling Rule #3: Every Fact table has a date dimension table associated with it Measurement events will always have a date associated with them

Day grain at a minimum Separate Time of Day dimension for time Multiple date/time dimension aliases on any given fact table Order Date Promise Date Ship date Kimballs Ten Essential rules of Dimensional Modeling Rule #4: Ensure that all facts in a single fact table are at the same grain or level of detail Possible grain types: Transactional (GL, Orders, etc.) Periodic Snapshot (Bank Balance, Inventory levels) Accumulating Snapshot (To date statuses, Fact table updated)

Kimballs Ten Essential rules of Dimensional Modeling Rule #5: Resolve many-to-many relationships in fact tables Use M:M Bridge tables when appropriate Multiple diagnoses associated with a single event Kimballs Ten Essential rules of Dimensional Modeling Rule #6: Resolve many-to-one relationships in dimension tables Resist the urge to snowflake Know what exceptions are allowable

Large product catalog: Size + Color + Style = SKU Kimballs Ten Essential rules of Dimensional Modeling Rule #7: Store report labels and filter domain values in dimension tables Keep the end user in mind Proper case attributes so they dont SHOUT Combines code/description attributes Division Code = 001 Division Name = Springfield Division Label = Springfield (001) or (001) Springfield Kimballs Ten Essential rules of Dimensional Modeling

Rule #8: Make sure dimension tables use a surrogate key Dont forget to have alternate keys on business key column(s)! Required to support slowly changing dimensions (SCDs) You cannot rely on source systems for business keys. Kimballs Ten Essential rules of Dimensional Modeling Rule #9: Create conformed dimension to integrate data across the enterprise Common reference dimensions are key to:

Integrating data across multiple source systems Required for cross-drill from one fact table to another Enterprise DW Bus Matrix the blue print for your EDW Reuse enables rapid development Require a commitment to data stewardship DW Bus Matrix Sample SCD2 Web Ad Affi liate Rank Affi liate Assignment

Campaign Cap Goal Campus Cap Goal Prospect x x x x x x x x x x x Exemption Billing Type SCD1 SCD1

InquryEvent Dimension Calendar Time of Day Accounting Customer Affiliate Agency CA Calling Campaign CA List Campaign CC Agent Client Curriculum Event Type Inquiry Geography Measure Groups Inqury DW Bus matrix x

x x x x x x x x x x x x x

x x x x x x x x x x x x Kimballs Ten Essential rules of Dimensional Modeling Rule #10: Continuously balance the requirements and realities to deliver a DW/BI solution thats

accepted by the business users and supports their decision making. Dont forget your audience Deliver incrementally Surrogate Keys Dimension PKs should be surrogate (meaningless) keys Managed by the DW Usually an integer type Usually populated via IDENTITY keyword in dimension table definition Why?

Small (int) keys are vital for performance The source system WILL screw you if you dont manage the keys yourself Enables dimension attribute change tracking Surrogate Keys and ETL Dimensions Carry source system key(s) as non-key attributes in the dimension New rows automatically get a new surrogate key Facts

Fact table usually does not contain source system keys Final step of fact processing is to exchange the source system keys for DW surrogate keys Lookup to dimension tables based on source key, returning surrogate key Conformed Dimensions There is one master dimension table that all fact tables subscribe to Get agreement organization-wide on: What the dimensions are called Which hierarchies you have Similar-but-different attributes and hierarchies have different names Which attributes are managed by restating history and which by tracking history Create two sets of attributes if you need it both ways

Why? Single version of the truth Flexibility of basic design Dimensional Modeling Myths and Misconceptions Dimensional means summary Dimensional models are built to support specific applications (or departments) The dimensional model is less flexible than a third normal form model in DW/BI systems The dimensional approach is not Enterprise oriented e s l a F 26

Gather Business Requirements Gather detailed requirements from the business users This means you have to talk to them Document requirements, or youll have to do it again Talk to IT (data experts) too Multi-step process [Sometimes] Overview of the landscape. Where should we begin? Overview of data realities Detailed requirements and data realities

Profile the Data Early and often Does the data exist to support the required analysis? Where are the problems affecting ETL design Primary keys Referential integrity NULL values Junk values The dreaded Notes field Hijacked fields (My Favorite!) Design the Dimensional Model

Dimensions are the key UI element to your organizations information Dimension grain Dimension hierarchies Dimension attribute changes What does a row in the dimension table represent? Real vs. navigational Type 1: Restate history Type 2: Track history Changing Attributes Each attribute (column) in the dimension table is subject to change What do the business users want? Sometimes they want it both ways

Type 1 (Update) Easiest for base ETL, simply update column in place Potentially problematic for aggregate management Type 2 (Track history) Add a new row for each new set of attributes Facts tie in to the set of attributes in effect at trxn time Track row effective date range, row is current, potentially row change reason Data Extraction Best Practices Minimize impact on the source system Save a copy of the untransformed data Often makes sense to separate E and TL into

separate packages Keep source queries simple Often, stage to file or table rather than ETL in stream Avoid transforming data upon extract For archival purposes (Internal Audit will love you, insofar as they are capable of emotion) Design packages with a restart point here Extracting Data Relational Usually sources pull from sources (use a query) Keep the source query simple Implement source query in the SSIS Data Flow task, Data Reader Source Non-relational Usually sources

pushed from source system Flat files common Third-party connectors can allow pull from within SSIS Compute Data Validity Measures Compute and check measures of data validity as early in the process as possible Rowcounts Reasonableness counts Income=Expense Sold Store more than N products to X customers data validity measures in processing metadata Evaluate data validity before moving forward

Dimension Hierarchy Management Get religious about hierarchies A true hierarchy has referential integrity between levels Anything else is simply a reporting relationship Do you need (or want) a key for hierarchy levels? The payoff comes to users, especially in SSAS The ETL system is not the right place to manage true hierarchies People need to be involved

Do it before ETL This is a job for Master Data Management or correctly designed source systems Dimension Loading during Fact Processing Normal case Process dimensions first, then facts Fact table transformation ends with surrogate key pipeline Look up each source system key for its DW surrogate key Join to populated dimension tables on source system key Return surrogate key Fairly

common exception case Load dimension row as observed in fact stream Early arriving facts Many-to-many dimensions Data Loading Best Practices SQL Destination vs. OLEDB Always use Error Flow on load Add a Data Viewer for the first time you try to load data Fast load restrictions limit incremental fast loads to partitioned tables in most scenarios Updates and deletes: Good design usually forbids actual deletes, instead set flag Ledger transaction-grain facts, even if your source

system does not OLEDB Command transform MERGE SQL statement Set-based UPDATE SQL statement Overall Design Considerations ETL application must understand Analysis Services requirements The only way to process a Facts update or delete in SSAS is to fully process the partition containing that fact ETL needs to know which SSAS partitions to process ETL developers arent allowed to compromise business user requirements without a cost/benefit analysis and signoff

Conclusion ETL system is harder to design and develop than most people think SQL Server Integration Services contains functionality you need to build an enterprise-class ETL system Must add some scripting Its not hard! Some problems are intractably difficult, and no tool will make them magically go away Dont skimp on ETL! Great ETL makes Analysis Services & Reporting easy. Additional Resources Books

The Microsoft Data Warehouse Toolkit, J. Mundy and W. Thornthwaite, Wiley (2006) The Data Warehouse ETL Toolkit, R. Kimball and J. Caserta, Wiley (2005) The Data Warehouse Toolkit, 2nd Edition, R. Kimball and M. Ross, Wiley (2002) The Data Warehouse Toolkit, 2nd Edition, R. Kimball and M. Ross, Wiley (2002) Even More Resources Websites All sample packages are at www.MsftDWToolkit.com See fourteen years of free articles at www.KimballGroup.com

Classes Microsoft Data Warehouse in Depth, a 4-day class in the SQL Server product set at http://www.kimballgroup.com/html/kucourseMDWD.ht ml Questions?

Recently Viewed Presentations

  • INTRODUCTION

    INTRODUCTION

    MMU & Referral Transport. HR to be included under respective heads i.eFMR B.11 & FMR B.12. 17. Innovation. All HR under innovations to be included under the head for General HR in RCH Flexible Pool FMR A.8. 18. New Initiatives....
  • Agenda Kepala Perwakilan (Januari 2018)

    Agenda Kepala Perwakilan (Januari 2018)

    Fun Bike BPK PerwakilanProvinsi Aceh dalamRangkaMemperingati HUT BPK RI Ke-72 dan HUT BACC Ke-3. Jumat / 11 Januari 2019. 07.00 - 11.00 WIB. Kantor BPK PerwakilanProvinsi Aceh. 4. PengambilanSumpah CPNS menjadi PNS Golongan III Tahun 2018 BPK Perwakilan Prov. Aceh.
  • Information and Telecommunication Technologies in the CR

    Information and Telecommunication Technologies in the CR

    Seneca Občan Rozpočet Legislativa Organizace Úředník Technologie Efektivní veřejná správa a přátelské veřejné služby Vzájemně vyváženo www.smartadministration.cz eGovernment - atraktivní obchodní příležitosti Symbol českého eGovernmentu Miloslav Marčan [email protected] www.mpo.cz * ã Ministry of Industry and Trade of ...
  • Enzymes - Weber State University

    Enzymes - Weber State University

    Biochemistry 3070 - Enzymes * Enzymes: Rate Enhancement Biochemistry 3070 - Enzymes * Enzymes - Turnover Number How fast can an enzyme produce products? The "turnover number" is used to rate the effeciency of an enzyme. This number tells how...
  • Rencontres européennes: L'accompagnement de la jeune création

    Rencontres européennes: L'accompagnement de la jeune création

    An online guide in progress by Marie Le Sourd and Mary Helen Young ... ECAS and PEARLE*) A set of information, web-references for the funding of culture related projects in Europe and beyond! The story behind the online guide (to...
  • Date: ___________ Time: _________ Name: ___________________ Location: ________________________

    Date: ___________ Time: _________ Name: ___________________ Location: ________________________

    Author: Sigler, W Adam Created Date: 08/29/2014 19:50:20 Title: PowerPoint Presentation Last modified by: Kleehammer, Katie Company: Montana State University
  • William Stallings, Cryptography and Network Security 5/e

    William Stallings, Cryptography and Network Security 5/e

    Cryptography and Network SecurityChapter 5. Fifth Edition. by William Stallings Lecture slides by Lawrie Brown. Lecture slides by Lawrie Brown for "Cryptography and Network Security", 5/e, by William Stallings, Chapter Chapter 5 -"Advanced Encryption Standard".
  • Prenatal Diagnosis and Genetic Counseling

    Prenatal Diagnosis and Genetic Counseling

    Bioethical Challenges for the Rehabilitation Counselor Bioethical Challenges - Current Advance Directives Physician Assisted Suicide Reproductive Technologies Prenatal Diagnosis and Selective Abortion Prenatal Diagnosis and Genetic Counseling Ethical and Human Rights Concerns Prenatal Testing Used to detect genetic disorders during...