SA475: Trends in Technology - Furman

SA475: Trends in Technology - Furman

Big Data and Data Analytics 1 CONTENTS Module 1: Big Data Module 2: Business Intelligence/Analytics Module 3: Visualization Module 4: Data Mining 2 MODULE 1

What is Big Data? 3 What is Big Data? What is Big Data Massive sets of unstructured/semi-structured data from Web traffic, social media, sensors, etc Petabytes, exabytes of data Volumes too great for typical DBMS Information from multiple internal and external sources: Transactions Social media Enterprise content Sensors Mobile devices

In the last minute there were . 204 million emails sent 61,000 hours of music listened to on Pandora 20 million photo views 100,000 tweets 6 million views and 277,000 Facebook Logins 2+ million Google searches 3 million uploads on Flickr What is Big Data? continued Companies leverage data to adapt products and services to:

Meet customer needs Optimize operations Optimize infrastructure Find new sources of revenue Can reveal more patterns and anomalies IBM estimates that by 2015 4.4 million jobs will be created globally to support big data 1.9 million of these jobs will be in the United States Where does Big Data come from? Email Enterprise Dark Data Partner, Employee Customer, Supplier Monitoring

Contracts Public Credit Commercial Sensor Weather Population Economic 6 Transactions Industry Social Media

Sentiment Network Types of Data Types of Data When collecting or gathering data we collect data from individuals cases on particular variables. A variable is a unit of data collection whose value can vary. Variables can be defined into types according to the level of mathematical scaling that can be 1. Categorical 2. Ordinal carried out on

the data. (Nominal) 3. Interval Ratio There are four types of data4.or levels of measurement: Categorical (Nominal) data Nominal or categorical data is data that comprises of categories that cannot be rank ordered each category is just different.

The categories available cannot be placed in any order and no judgement can be made about the relative size or distance from one category to another. Categories bear no quantitative relationship to one another Examples: - customers location (America, Europe, Asia) - employee classification (manager, supervisor, associate) What does this mean? No mathematical operations can be performed on the data relative to each other. Therefore, nominal data reflect qualitative differences rather than quantitative ones. Nominal data Examples: What is your gender? (please tick)

Did you enjoy the film? (please tick) Male Female Yes No Systems for measuring nominal data must ensure that each category is mutually exclusive and the system of measurement needs to be exhaustive. Exhaustive: the system of categories system should have enough categories for all the observations Variables that have only two responses i.e. Yes or No, are known as dichotomies.

Ordinal data Example: How satisfied are you with the level of service you have received? (please tick) Very satisfied Somewhat satisfied Neutral Somewhat dissatisfied Very dissatisfied Ordinal data is data that comprises of categories that can be rank ordered. Similarly with nominal data the distance between each category cannot be calculated but the categories can be ranked above or below each other.

No fixed units of measurement Examples: - college football rankings - survey responses (poor, average, good, very good, excellent) Interval and ratio data Both interval and ratio data are examples of scale data. Scale data: data is in numeric format ($50, $100, $150) data that can be measured on a continuous scale the distance between each can be observed and as a result measured the data can be placed in rank order. Interval data Ordinal data but with constant differences

between observations Ratios are not meaningful Examples: Time moves along a continuous measure or seconds, minutes and so on and is without a zero point of time. Temperature moves along a continuous measure of degrees and is without a true zero. SAT scores Ratio data Ratio data measured on a continuous scale and does have a natural zero point. Ratios are meaningful

Examples: monthly sales delivery times Weight Height Age Data for Business Analytics (continued) Classifying Data Elements in a Purchasing Database

al rv te In al rv te In io at R io at R io at

R io at R al ic or eg at al C ic or eg al at ic C or eg at C

al ic or eg at C Figure 1.2 If there was field (column) for Supplier Rating (Excellent, Good, Acceptable, Bad), that data would be classified as Ordinal 1-15 Growing quantity of data e.g. social media, behavioral, video Quickening speed of data e.g. smart meters, process monitoring

VOLUME Big Data Characteristics Y T IE R VA VELOCITY Gartner, Feb 2001 Increase in types of data e.g. app data, unstructured data Which Big Data characteristic is the biggest issue for your organization?

Velocity of data 16.35% Variety of data 48.43% Volume of data 35.22% Source: Getting Value from Big Data, Gartner Webinar, May 2012 Volume Volume Petabytes, exabytes of data Volumes too

great for typical DBMS Volume - Bytes Defined eBay data warehouse (2010) = 10 PB eBay will increase this 2.5 times by 2011 Teradata > 10 PB Megabyte: 220 bytes or, loosely, one 5-19 million bytes Gigabyte: 230 bytes or, loosely one billion bytes Velocity Velocity

Massive amount of streaming data Variety Variety Massive sets of unstructured/se mi-structured data from Web traffic, social media, sensors, and so on Which source of data represents the most immediate opportunity?

Source: Getting Value from Big Data, Gartner Webinar, May 2012 Big Data Opportunities Making better informed decisions e.g. strategies, recommendations Discovering hidden insights e.g. anomalies forensics, patterns, trends Automating business processes e.g. complex events, translation Which is the biggest opportunity for Big Data in your organization? Through 2017:

85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage. Business analytics needs will drive 70% of investments in the expansion and modernization of information infrastructure. Source: Getting Value from Big Data, Gartner Webinar, May 2012 Identifying Insurance FraudAuto Insurance Opportunity Save and make money by reducing fraudulent auto insurance claims

Data & Analytics Predictive analytics against years of historical claims and coverage data Text mining adjuster reports for hidden clues, e.g. missing facts, inconsistencies, changed stories Results Improved success rate in pursuing fraudulent claims from 50% to 88%; reduced fraudulent claim investigation time by 95% Marketing to individuals with low propensity for fraud What **dark data is just laying around that can transform business processes? **Operational data that is not being used. Consulting and market research company Gartner Inc. describes dark data as "information assets that organizations collect, process and store in the course of their regular business activity, but generally fail to use25for other purposes."

Quality Improvement Opportunity Data & Analytics Move from manual to automated inspection of burger bun production to ensure and improve quality Photo-analyze over 1000 buns-per-minute for color,

shape and seed distribution Continually adjust ovens and process automatically Result Eliminate 1000s of pounds of wasted product per year; speed production; save energy; Reduce manual labor costs Is the company using all of its senses to observe, measure and optimize business processes? 26 Improving Corporate Image Opportunity Improve reputation, brand and buzz by tapping social media

Data & Analytics Continually scanning twitterverse for mentions of their business Integrating tweeters with their robust customer management system Results Saw tweet from a top customer lamenting late flightno time to dine at Mortons Tuxedo-clad waiter waiting for him when he landed with a bag containing his favorite steak, prepared the way he normally likes it with all the fixins How can the company listen, analyze and respond in real-time?

Big Data, Big Rewards Interactive Session: Technology Read the case study Big Data, Big Rewards Read the Interactive Session and discuss the following questions MODULE 2 Business Analytics 29 Business Analytics/Business Intelligence Business Analytics/Business intelligence (BI) is a broad category of applications, technologies, and processes for:

gathering, storing, accessing, and analyzing data to help business users make better decisions. Things Are Getting More

Complex Many companies are performing new kinds of analytics (**sentiment analysis, etc.), to better and more quickly understand and respond to what customers are saying about them and their products. The cloud, and appliances are being used as data stores Advanced analytics are growing in popularity and importance **Sentiment analysis (also known as opinion mining) refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. Uncertainty of Data

Analytics Models How can we make it happen? VALUE What will happen? Predictive Analytics Why did it happen? Diagnostic Analytics What happened?

ht g i Ins o In f r n Hi Op ti o a

z i ti m Fo Descriptive Analytics ti o a m Prescriptive Analytics ht g i nds

DIFFICULTY 33 n ht g i re s Descriptive Analytics Descriptive analytics, such as reporting/OLAP, dashboards, and data visualization, have been widely used for some time. They are the core of traditional BI. What has occurred? Descriptive analytics, such as data visualization, is

important in helping users interpret the output from predictive and predictive analytics. Predictive Analytics Algorithms for predictive analytics, such as regression analysis, machine learning, and neural networks, have also been around for some time. What will occur? Marketing is the target for many predictive analytics applications. Descriptive analytics, such as data visualization, is important in helping users interpret the output from predictive and prescriptive analytics. Prescriptive Analytics Prescriptive analytics are often referred to as advanced analytics. Often for the allocation of scarce resources Optimization

What should occur? Prescriptive analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demographic trends and population health trends, to more accurately plan for future capital investments such as new facilities and equipment utilization as well as understand the trade-offs between adding additional beds and expanding an existing facility versus building a new one. Organizational Transformation Analytics are a competitive requirement For BI-based organizations, the use of BI/analytics is a requirement for successfully competing in the marketplace. TDWI report on Big Data Analytics found that 85% of respondents indicated that their firms would be using advanced analytics within three years

IBM/MIT Sloan Management Review research study found that top performing companies in their industry are much more likely to use analytics rather than intuition across the widest range of possible decisions. Complex Systems Require Analytics Tackle complex problems and provide

individualized solutions Products and services are organized around the needs of individual customers Dollar value of interactions with each customer is high There is high level of interaction with each customer Examples: IBM, World Bank, Halliburton Volume Operations Require Analytics

Serves high-volume markets through standardized products and services Each customer interaction has a low dollar value Customer interactions are generally conducted through technology rather than person-to-person Are likely to be analytics-based Examples: Amazon.com, eBay, Hertz The Nature of the Industry

Online retailers like Amazon.com and Overstock.com are high volume operations who rely on analytics to compete. When you enter their sites a cookie is placed on your PC and all clicks are recorded. Based on your clicks and any search terms, recommendation engines decide what products to display. After you purchase an item, they have additional information that is used in marketing campaigns. Customer segmentation analysis is used in deciding what promotions to send you. How profitable you are influences how the customer care center treats you. A pricing team helps set prices and decides what prices are needed to clear out merchandise. Forecasting models are used to decide how many items to order for inventory. Dashboards monitor all aspects of organizational performance

Knowledge Requirements for Advanced Analytics Business Domain Data Modeling Choosing the right data to include in models is important. Important to have some thoughts as to what variables might be related. Domain knowledge is necessary to understand how they can be used. Role of Business Analyst is crucial Consider the story of the relationship between beer and diapers in the market basket of young males in convenience stores. You still have to decide (or experiment to discover) whether it is better to put them together or spread them across the store (in the hope that other things will be bought while walking the isles). The findings were that men between 30- 40 years in age, shopping between 5pm

and 7pm on Fridays, who purchased diapers were most likely to also have beer in their carts. This motivated the grocery store to move the beer isle closer to the diaper isle and instantaneously, a 35% increase in sales of both! MODULE 3 Visualization 42 Visualization: Acquisition of Insight Many people and institutions possess data that may hide fundamental relations

Realtors Bankers Air Traffic Controller Fraud investigators Engineers They want to be able to view some graphical representation of that data, maybe interact with it, and then be able to say.ahha! Example: Fraud Detection The Serious Fraud Office (SFO) suspected mortgage fraud The SFO provided 12 filing cabinets of data

After 12 person years a suspect was identified The suspect was arrested, tried and convicted Example: Fraud Detection continued The data was supplied in electronic form A visualization tool (Netmap) was used to examine the data After 4 person weeks the same suspect was identified A master criminal behind the fraud was also identified Is Information Visualization Useful?

Drugs and Chips Texas Instruments Manufactures microprocessors on silicon wafers that are routed through 400 steps in many weeks. This process is monitored, gathering 140,000 pieces of information about each wafer. Somewhere in that heap of data can be warnings about things going wrong. Detect a bug early before bad chips are made. TI uses visualization tools to make the detection process easier Eli Lilly Has 1500 scientists using an advanced information visualization tool (Spotfire) for decision making. With its ability to represent multiple sources of information and interactively change your view, its helpful for homing in on specific molecules and deciding whether we should be doing further testing on them Sheldon Ort of Eli Lilly, speaking to Fortune The Cholera Epidemic, London 1845 Dr. John Snow, medical officer for London, investigated the cholera epidemic of 1845 in Soho. He mapped the deaths and noted that

the deaths, indicated by points, tended to occur near the Broad Street pump. Closure of the pump coincided with a reduction in cholera. Challenger Disaster On 28th January 1986 the space shuttle Challenger exploded, and seven astronauts died, because two rubber O-Rings leaked. The previous day, engineers who designed the rocket opposed the launch, concerned that the O-Rings would not seal at the forecast temperature (25 to 29oF). After much discussion, the decision was taken to go ahead. Cause of the accident: An inability to assess the link between cool temperature and O-Ring damage on earlier flights. Many charts poorly presented Visualization

Refers to the innovative use of images and interactive technology to explore large, high- density datasets Help users see patterns and relationships that would be difficult to see in text lists Rich graphs, charts Dashboards

Maps Increasingly is being used to identify insights into both structured and unstructured data for such areas as operational efficiencies profitability strategic planning Video Tableau

Examples Geo data mapping Demo Introduction to Information Visualization - Fall 2012 Examples Treemap

Demo Introduction to Information Visualization - Fall 2012 Examples Population Trendalyzer Demo Introduction to Information Visualization - Fall 2012

Video on the use of visualization to extract knowl edge from data . Watch Gary Flake on extreme visualization 53 MODULE 4 Data Mining 54 What is Data Mining? The process of semi automatically analyzing large databases to find useful patterns (Silberschatz) Areas of Use

Sales/ Marketing Diversify target market Identify clients needs to increase response rates Risk Assessment Identify Customers that pose high credit risk Fraud Detection Identify people misusing the system. E.g. People who have two Social Security Numbers Credit Card Fraud Detection Detect significant deviations from normal behavior: Network Intrusion Detection Customer Care Identify customers likely to change providers Identify customer needs

Medicine Match patients with similar problems cure Data Mining Techniques... Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]

Classification Classification is the process of predicting the class of a new item. Categorize the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers. Technique for Classification Decision-Tree Classifiers Job Engineer

Carpenter Income <30 K Bad >50 K Good Income <40 K >90

K Bad Good Doctor Income >100K <50 K Bad Predicting credit risk of a person with the jobs Good

Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, dont buy} decision forms the class attribute. Collect various demographic, lifestyle, and companyinteraction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier

model. Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc. Label past transactions as fraud or fair transactions. This

forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account. Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what timeof-the day he calls most, his financial status, marital status, etc.

Label the customers as loyal or disloyal. Find a model for loyalty. Clustering Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning Clustering continued Group data into clusters

Similar data is grouped in the same cluster Dissimilar data is grouped in the a different cluster How is this achieved ? Hierarchical Group data into t-trees K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and

points in the training data set. Then it assigns the point to the class that is most common among its knearest neighbors (where k is an integer) Clustering: Application 1 Document Clustering: Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.

Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. Clustering: Application 2 Market Segmentation: Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach:

Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. Association Rule: Definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Example: When a customer buys a hammer, then 90% of the time they will buy nails. TID Items

1 2 Bread, Coke, Milk Beer, Bread 3 4 Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk Rules RulesDiscovered:

Discovered: {Milk} {Milk}--> -->{Coke} {Coke} {Diaper, {Diaper,Milk} Milk}--> -->{Beer} {Beer} Association Rule Discovery: Application 1 Marketing and Sales Promotion: Let the rule discovered be {Bagels, } --> {Potato Chips} Potato Chips as consequent => Can be used to determine what should be done to boost its sales.

Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips! Association Rule Discovery: Application 2 Supermarket shelf management. Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among

items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, dont be surprised if you find six-packs stacked next to diapers! Association Rule Discovery: Application 3 Inventory Management: Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. Approach: Process the data on tools and parts

required in previous repairs at different consumer locations and discover the co-occurrence patterns. Sequential Pattern Discovery Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. In telecommunications alarm logs, (Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm) --> (Fire_Alarm) In point-of-sale transaction sequences, Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) --> (Perl_for_dummies,Tcl_Tk) Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) Regression Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency. Greatly studied in statistics, neural network fields. Examples: Predicting sales amounts of new product based on

advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices. Using Databases to Improve Business Performance and Decision Making Web mining Discovery and analysis of useful patterns and information from Web Web content mining

Mines content of Web pages Web structure mining Understand customer behavior Evaluate effectiveness of Web site, and so on Analyzes links to and from Web page Web usage mining

Mines user interaction data recorded by Web server Using Databases to Improve Business Performance and Decision Making Text mining Extracts key elements from large unstructured data sets Stored e-mails

Call center transcripts Legal cases Patent descriptions Service reports, and so on Sentiment analysis software Mines e-mails, blogs, social media to detect opinions Big Data, Big Rewards Interactive Session: Technology Read the case study Big Data, Big Rewards Read the Interactive Session and discuss the following questions

Recently Viewed Presentations

  • Skeletal System

    Skeletal System

    Vertebral prominens - useful landmark. Thoracic Vertebrae. Long spinous processes. Rib facets. Lumbar Vertebrae. Large bodies. Thick, short spinous processes. Sacrum. Five fused bertebrae. Median sacral crest. Dorsal sacral foramina. Posterior wall of pelvic cavity.
  • An Update Regarding the Dynamics of Coupled Natural and Human ...

    An Update Regarding the Dynamics of Coupled Natural and Human ...

    The Dynamics of Coupled Natural and Human Systems (CNH) Program. Richard Yuretich. NSF Directorate for Geosciences. July 31, 2015. 7/31/2015. 2015 Early Career Geosciences
  • Stress et bu rnout: facteurs de risque et de protection

    Stress et bu rnout: facteurs de risque et de protection

    Le stressAnalogie de la balance de Lazarus et Folkman. Référence p.9. Analogie de la balance de Lazarus et Folkman. Lazarus et Folkman présente sous forme d'une balance l'interaction entre d'une part, la demande (interne ou externe) qui a été créée...
  • ENGLISH FOR PUBLIC ADMINISTRATION III Revision (1) Answer

    ENGLISH FOR PUBLIC ADMINISTRATION III Revision (1) Answer

    Elective. Apersonsuch as anemployee, customer, orcitizen who isinvolvedwith anorganization, society, etc. Andtherefore hasresponsibilities towards it and an interestin itssuccess. Stakeholder. The branch of scientific or engineering knowledge that deals with the creation and practical use of digital or computerized devices,...
  • IEEE Conference on Technologies for Sustainability

    IEEE Conference on Technologies for Sustainability

    IEEE Sustech Conferences. The IEEE Conference on Technologies for Sustainability (Sustech) is an annual conference that aims to address & explore the development and application of science, engineering and technology to promote sustainability.. The conference in its 4th year is...
  • Fat Bodies: where do they stand? PHYSICAL EDUCATION

    Fat Bodies: where do they stand? PHYSICAL EDUCATION

    The way language is used to suggest it is a problem that is 'unacceptable' and 'out of control' promotes the idea that to possess a 'fat' body is a moral deviance (Monaghan, 2008b; Murray, 2007; Sykes and McPhail, 2008). As...
  • LOW MACH (microsecond) STAGE

    LOW MACH (microsecond) STAGE

    Times New Roman Arial Symbol 1_Оформление по умолчанию Microsoft Equation 3.0 MathType 4.0 Equation Bitmap Image Рисунок Microsoft Word Точечный рисунок CorelDRAW 10.0 Graphic Origin Graph Диаграмма Microsoft Graph Презентация Microsoft PowerPoint Deuterium-Deuterium Thermonuclear ...
  • Wednesday, 9/4/13

    Wednesday, 9/4/13

    it values grace 3. The passage illustrates the values of Renaissance humanism, because. it instructs the individual to pay attention to personal appearance. it argues that reason should be the measure of all things. it looks to the ancients for...