Database Design (constructed by Hanh Pham based on slides from books and websites listed under reference section) Big Data 14-1 Outline 1. Introduction 2. Characteristics 3. Architectures 4. Technologies

5. Applications 14-2 Big Data 1. Introduction 14-3 Whats Big Data? Big data = a collection of data sets which is very large

and complex such that traditional data processing methods (including relational) are inadequate. Size: terabytes (TB, 1012 or strictly 240 bytes) to petabytes (PB, 1015 bytes or 1,000 TB or 1,000,000 GB). 14-4 Example of Big Data: Wikipedia has TBs 14-5

Sources & Growth of Big Data Sources: information-sensing mobile devices aerial (remote sensing) cameras, microphones software logs radio-frequency identification (RFID) readers wireless sensor networks Growth: (recently) capacity to store information is doubled every 40 months every day 2.5 exabytes (1018) of data were created 14-6

Growth of Big Data 14-7 How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day

(5/2009) CERNs Large Hydron Collider (LHC) generates 640K ought to be 15 PB a year enough for anybody. Data Formats Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data

Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once Data Types Big Data 2. Characteristics 14-11

Big Data Characteristics Volume The quantity of data that is generated is very important in this context. It is the size of the data which determines the value and potential of the data under consideration and whether it can actually be considered Big Data or not. The name Big Data itself contains a term which is related to size and hence the characteristic. Variety - The next aspect of Big Data is its variety. This means that the category to which Big Data belongs to is also an essential fact that needs to be known by the data analysts. This helps the people, who are closely analyzing the data and are associated with it, to effectively use the data to their advantage and thus upholding the importance of the Big Data. Velocity - The term velocity in the context refers to the speed of generation of data or how fast the data is generated and processed to meet the demands and the challenges which lie ahead in the path of growth and

development. Variability - This is a factor which can be a problem for those who analyse the data. This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively. Veracity - The quality of the data being captured can vary greatly. Accuracy of analysis depends on the veracity of the source data. Complexity - Data management can become a very complex process, especially when large volumes of data come from multiple sources. These data need to be linked, connected and correlated in order to be able to grasp the information that is supposed to be conveyed by these data. This situation, is therefore, termed as the complexity of Big Data. 14-12

Big Data: 3Vs 14-13 Volume (Scale) Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially

Exponential increase in collected/generated data 14-14 30 billion RFID 12+ TBs of tweet data every day tags today

(1.3B in 2005) 4.6 billion camera phones world wide ? TBs of data every day

100s of millions of GPS enabled devices sold annually 25+ TBs of 2+ billion

log data every day 76 million smart meters in 2009 200M by 2014 people on the Web by end 2011

14-15 CERNs Large Hydron Collider (LHC) generates 15 PB a year Maximilien Brice, CERN 14-16 The Earthscope The Earthscope is the world's largest science project. Designed to track North America's geological evolution,

this observatory records data over 3.8 million square miles, amassing 67 TB of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363 598/ns/technology_and_sciencefuture_of_technology/#.TmetOdQ--uI) 14-17 Variety (Complexity)

Relational Data (Tables/Transaction/Legacy Data) Text Data (Web) Semi-structured Data (XML) Graph Data Social Network, Semantic Web (RDF), Streaming Data You can only scan the data once

A single application can be generating/collecting many types of data Big Public Data (online, weather, finance, etc) To extract knowledge all these types of data need to linked together 14-18 A Single View to the Customer

Bankin g Financ e Social Media Our Known History

Customer Gamin g Entertain Entertain Purchas e 14-19

Velocity (Speed) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction

14-20 Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and

networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 14-21

Some Make it 4Vs 14-22 Big Data 3. Architectures 14-23 Big Data Architectures

HPCC (open-source,2004): distributed file sharing framework for storage and querying of structured, semi-structured and/or unstructured data, publicly available platforms capable of analyzing EBs of data. MapReduce (Google, 2004) and Hadoop(Apache open-source) :parallel processing model and associated implementation, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step).

MIKE2.0: revisions due to big data implications,handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records. 5C: (connection, conversion, cyber, cognition, and configuration) for Manufacturing Applications.

14-24 Processing Big Data OLTP: Online Transaction Processing (DBMSs) OLAP: Online Analytical Processing (Data Warehousing) RTAP: Real-Time Analytics Processing (Big Data Architecture & technology) 14-25

The Model Has Changed The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 14-26 Whats driving Big Data

- Optimizations and predictive analytics - Complex statistical analysis - All types of data, and many sources - Very large datasets - More of a real-time - Ad-hoc querying and reporting - Data mining techniques - Structured data, typical sources - Small to mid-size datasets

14-27 Big Data 4. Technologies 14-28 Layers of Technologies for Big Data Database: NoSQL MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper

Schema: MapReduce Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum Storage S3, Hadoop Distributed File System Servers EC2, Google App Engine, Elastic, Beanstalk, Heroku Processing R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop 14-29

Big Data Technology 14-30 Big Data 5. Applications 14-31 Big Data is Everywhere

Communication (texts, Social Network) Government (citizen records, tax, ) Manufacturing Media (Internet of Things, ) Commerce Retail Banking

Science (human genome, climate simulation, ) Big Data is Everywhere 14-34 References This slide is constructed using the slides and materials from the following books and websites:

Big Data: Principles and best practices of scalable realtime data systems, 2015 edition, by Nathan Marz , James Warren NoSQL Distilled, Pramod J. Sadalage and Martin Fowler NoSQL for Mere Mortals, Dan Sullivan Michael Stonebraker, http://www.nist.gov/ Marko Grobelnik, Big-Data Tutorial Ruoming Jin, http://www.cs.kent.edu/~jin/ Wikipedia 14-35

Recently Viewed Presentations

  • Focus On… "Data Collection Choices"

    Focus On… "Data Collection Choices"

    On the right hand side I've converted each of these into the indicators. Now this is busier than you're ever going to have time to look at right now so I want to focus At this point you've described your...
  • Presentación de PowerPoint

    Presentación de PowerPoint

    Ser excel·lent és ser creador d'alguna cosa, un sistema, un lloc, una empresa, una llar, una vida. S'excel·lent és exercir la nostra llibertat i ser responsable de cadascuna de les nostres accions. Ser excel·lent és aixecar els ulls de la...
  • The Plains and British Columbia

    The Plains and British Columbia

    About one quarter of Canada's farmland is in the province of Saskatchewan. Most European immigrants became wheat farmers. Two thirds of Saskatchewan's farmland is still devoted to wheat. For this reason the province is sometimes called "Canada's Breadbasket."
  • Compounds with Oxygen Atoms

    Compounds with Oxygen Atoms

    Chapter 3 Alcohols, Phenols, Ethers, Thiols 3.1 Organic Compounds with Oxygen Atoms -O- or O= 3.2 Organic Compounds with Oxygen Atoms Alcohols -OH hydroxyl CH3-OH CH3CH2-OH Hydroxy groups have the same molecular shape as water 3.3 Naming Alcohols A carbon...
  • A Contemporary Covenant Service

    A Contemporary Covenant Service

    A Contemporary Covenant Service Brothers and sisters in Christ, citizens of the Kingdom of God, let us together reaffirm our commitment to the solemn covenant which God has made with his people of all ages, nations and races, a covenant...
  • Chapter 14

    Chapter 14

    The pastoral peoples of this region frequently raided Russian territory, selling many captives into slavery. Expansion into Siberia was driven by demand on the world market for the pelts of fur-bearing animals, although later some agricultural settlement took place.
  • Exam 1 Notes Software Testing and Verification Lecture

    Exam 1 Notes Software Testing and Verification Lecture

    Coverage Lectures Notes 1-14 Readings 1-6: Myers, The Art of Software Testing Kit, Software Testing in the Real World Gause & Weinberg, Making Meetings Work… Fagan, Design and Code Inspections… Grady & Van Slack, Key Lessons in Achieving Widespread Inspection...
  • On the Brightness of Bulbs Resistance Blackbody Radiation

    On the Brightness of Bulbs Resistance Blackbody Radiation

    Black in this context just means reflected light isn't important Hot charcoal in a BBQ grill may glow bright orange when hot, even though they're black Sure, not everything is truly black, but at thermal infrared wavelengths (2-50 microns), you'd...