UPLOAD

    219

    Big data overview

    Published: August 10, 2018

    Big data overview from Ibm Learning

    Comments

    Big data overview

    • 1. Slide1 Big Data Overview IBM Skills Academy
    • 2. Agenda © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Agenda •Introduction to Big Data •Growth of Interconnected Devices •Examples of Big Data •Types of Big Data •Big Data in the Industry •Big Data in Healthcare (use cases) •Big Data in Finance (use cases) •Big Data in Telecommunication (use cases) •Big Data in Retail and Social Media (use cases) •Big Data in Manufacurting (use cases) •Big Data and Internet of Things
    • 3. Introduction to Big Data © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Introduction to Big Data •A tsunami of Big Data •The Vs of Big Data (3Vs, 4Vs, 5Vs, …) §The count depends on who does the counting •The Ecosystem §Apache open Source §The distributions §The add-ons §Open Data Platform Initiative (OPDi.org) •Some basic terminology
    • 4. Big Data - a tsunami that is hitting us already © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data - a tsunami that is hitting us already •We are witnessing a tsunami of data: §Huge volumes §Data of different types and formats §Impacting the business at new and ever increasing speeds •The challenges: §Capturing, transporting, and moving the data §Managing - the data, the hardware involved, and the software (open source and not) §Processing - from munging the raw data to programming to provide insight into the data §Storing - safeguarding and securing −“Big Data refers to non-conventional strategies and innovative technologies used by businesses and organizations to capture, manage, process, and make sense of a large volume of data” •The industries involved •The futures
    • 5. Data has an intrinsic property…it grows and grows © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Data has an intrinsic property…it grows and grows 1 in 2 business leaders don’t have access to data they need 83% of CIO’s cited BI and analytics as part of their visionary plan 5.4X more likely that top performers use business analytics 80% of the world’s data today is unstructured 90% of the world’s data was created in the last two years 20% of available data can be processed by traditional systems
    • 6. Growing interconnected & instrumented world  © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Growing interconnected & instrumented world
    • 7. Growth in Internet traffic (PCs, smartphones, IoT,…) © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Growth in Internet traffic (PCs, smartphones, IoT,…)
    • 8. Some examples of Big Data © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Some examples of Big Data •Science •Astronomy •Atmospheric science •Genomics •Biogeochemical •Biological −and other complex / interdisciplinary scientific research •Social •Social networks •Social data −Person to person (P2P, C2C): •Wish Lists on Amazon.com •Craig’s List •Person to world (P2W, C2W): •Twitter •Facebook •LinkedIn •Medical records §Commercial §Web / event / database logs §"Digital exhaust" - result of human interaction with the Internet §Sensor networks §RFID §Internet text and documents §Internet search indexing §Call detail records (CDR) §Photographic archives §Video / audio archives §Large scale eCommerce §Government §Regular government business and commerce needs §Military and homeland security surveillance
    • 9. Types of Big Data © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Types of Big Data •Structured §Data that can be stored and processed in a fixed format, aka schema •Semi-structured §Data that does not have a formal structure of a data model, i.e. a table definition in a relational DBMS, but nevertheless it has some organizational properties like tags and other markers to separate semantic elements that makes it easier to analyze, aka XML or JSON •Unstructured §Data that has an unknown form and cannot be stored in RDBMS and cannot be analyzed unless it is transformed into a structured format is called as unstructured data §Text Files and multimedia contents like images, audios, videos are example of unstructured data - unstructured data is growing quicker than others, experts say that 80 percent of the data in an organization is unstructured
    • 10. Big Data Use Cases © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data Use Cases •Healthcare •Financial •Industry •Agriculture …and many others
    • 11. Use cases for a Big Data platform: Healthcare and Life Sciences © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Healthcare and Life Sciences •Problem: §Vast quantities of real-time information are starting to come from wireless monitoring devices that postoperative patients and those with chronic diseases are wearing at home and in their daily lives. •How big data analytics can help: §Epidemic early warning §Intensive Care Unit and remote monitoring
    • 12. Healthcare © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Healthcare •How Big Data Is Quietly Fighting Diseases and Illnesses §http://dataconomy.com/how-big-data-is-quietly-fighting-diseases-and-illnesses •The Data Is In: 3 Ways Analytics Will Improve Healthcare §http://dataconomy.com/the-data-is-in-3-ways-analytics-will-improve-healthcare
    • 13. Big Data and complexity in healthcare © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data and complexity in healthcare •Medical information is doubling every 5 years, much of which is unstructured •81% of physicians report spending 5 hours or less per month reading medical journals
    • 14. Precision Medicine Initiative (PMI) & Big Data © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Precision Medicine Initiative (PMI) & Big Data •Precision Medicine §A medical model that proposes the customization of healthcare, with medical decisions, practices, and/or products being tailored to the individual patient - Wikipedia §Diagnostic testing is often employed for selecting appropriate and optimal therapies based on the context of a patient’s genetic content or other molecular or cellular analysis §Tools employed in PM can include molecular diagnostics, imaging, and analytics/software •The Precision Medicine Initiative (PMI) §A $215 million investment in President Obama’s Fiscal Year 2016 Budget to accelerate biomedical research and provide clinicians with new tools to select the therapies that will work best in individual patients
    • 15. Use cases for a Big Data platform: Financial Services © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Financial Services •Problem: §Manage the several Petabytes of data which is growing at 40-100% per year under increasing pressure to prevent frauds and complaints to regulators •How big data analytics can help: §Fraud detection §Credit issuance §Risk management §360° view of the Customer
    • 16. Financial marketplace example: Visa © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Financial marketplace example: Visa •Problem §Credit card fraud costs up to 7 cents per 100 dollars – billions of dollars per year §Fraud schemes are constantly changing §Understanding the fraud pattern months after the fact is only partially helpful - fraud detection models need to evolve faster •If only Visa could … §Reinvent how to detect the fraud patterns §Stop new fraud patterns before they can rack-up significant losses •Solution §Revolutionize the speed of detection §Visa loaded two years of test records, or 73 billion transactions, amounting to 36 terabytes of data into Hadoop - the processing time fell from one month with traditional methods to a mere 13 minutes
    • 17. Financial © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Financial •Big data is overhauling credit scores §http://dataconomy.com/big-data-overhauling-credit-scores-2 •Top 10 Big Data Trends in 2016 for Financial Services §https://www.mapr.com/blog/top-10-big-data-trends-2016-financial-services
    • 18. Use cases for a Big Data platform: Telecommunications Services © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Telecommunications Services •Problem: §Legacy systems are used to gain insights from internally generated data facing issues of high storage costs, long data loading time, and long administration processing times •How big data analytics can help: §CDR processing §Combat fraud §Churn prediction §Geomapping / marketing §Network monitoring
    • 19. Use cases for a Big Data platform: Transportation Services © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Transportation Services •Problem: §Traffic congestion has been increasing worldwide as a result of increased urbanization and population growth reducing the efficiency of transportation infrastructure and increasing travel time and fuel consumption. •How big data analytics can help: §Urban planning & monitoring §Real time analysis to weather and traffic congestion data streams to identify traffic patterns reducing transportation costs.
    • 20. Use cases for a Big Data platform: Retailers & Social Media © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Retailers & Social Media •Problem: §Savvy retailers want to use “big data” to predict trends, prepare for demand, pinpoint customers, optimize pricing & promotions, and monitor real-time analytics & results - by combining data from web browsing patterns, social media, industry forecasts, existing customer records, etc. •How big data analytics can help: §Access social media to gain insight §Federate data between Big Data and RDBMs §Apply graph analysis to the available data §Work to understand demand and engage customers
    • 21. Graph analytics © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Graph analytics •Path analysis •Connectivity analysis www.ibmbigdatahub.com/blog/what-graph-analytics •Community analysis •Centrality analysis
    • 22. Behavioral segmentation & analytics © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Behavioral segmentation & analytics
    • 23. Use cases for a Big Data platform: Manufacturing © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Use cases for a Big Data platform: Manufacturing •Problem: §The world of production will become more and more networked until everything is interlinked with everything else. The complexity of production and supplier networks has grow enormously. Previously, networks and processes were limited to one factory, but the boundaries of individual factories will most likely no longer exist in favor of the interconnect of multiple factories or even geographical regions.. •How big data analytics can help: §The Internet of Things (IoT) §Industry 4.0
    • 24. Articles on IoT - and Industry 4.0 © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Articles on IoT - and Industry 4.0 •GE & Siemens - the American industrial giant is sprinting towards its goal; the German firm is taking a more deliberate approach http://www.economist.com/news/business/21711079-american-industrial-giant-sprinting-towards-its-goal-german-firm-taking-more •Plus many more articles •And articles specific to the various industry sector
    • 25. Data 3.0 - my view on the data landscape © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Data 3.0 - my view on the data landscape •The Eras of Data §0 Flat files §1 Relational Databases (RBDMs) - 1970s - OLTP (Online Transactional processing) §2 Data Warehouses - 1990s - OLAP (Online Analytical processing) or DSS (Decision Support Systems) workloads §3 Big Data - 2000s - Batch, with a movement towards Real-time •Some terminology of Big Data §Oceans of data (data at rest) vs. Streams of data (data in motion) §Data Lake (a large storage repository and processing engine) •James Dixon of Pentaho used the term initially to contrast with “data mart,” which is a smaller repository of interesting attributes extracted from the raw data. He wrote: "If you think of a datamart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.“ (Wikipedia, Data Lake) §NoSQL (“not only SQL”)
    • 26. Realtime is a definite future direction for analytics © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Realtime is a definite future direction for analytics •IBM System S was a precursor of the future IBM InfoSphere Streams and now the open-source competitors such as Apache Storm, Apache Kafka, etc.
    • 27. Industry 4.0 © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Industry 4.0 •“Industry 4.0” was the brainchild of the German government, and describes the next phase in manufacturing - a so-called fourth industrial revolution §Industry 1.0: Water/steam power §Industry 2.0: Electric power §Industry 3.0: Computing power §Industry 4:0: Internet of Things (IoT) power •Meaning −Characteristic for industrial production in an Industry 4.0 environment are the strong customization of products under the conditions of highly flexibilized (mass-) production. The required automation technology is improved by the introduction of methods of self-optimization, self-configuration, self-diagnosis, cognition and intelligent support of workers in their increasingly complex work. −Wikipedia: https://en.wikipedia.org/wiki/Industry_4.0
    • 28. The six design principles in Industry 4.0 © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics The six design principles in Industry 4.0 •Interoperability §The ability of cyber-physical systems (i.e. workpiece carriers, assembly stations, and products), humans and Smart Factories to connect/communicate with each other via the Internet of Things (IoT) and the Internet of Services (IoS) •Virtualization §A virtual copy of the Smart Factory which is created by linking sensor data (from monitoring physical processes) with virtual plant models and simulation models •Decentralization §The ability of cyber-physical systems within Smart Factories to make decisions on their own •Real-time Capability §The capability to collect and analyze data and provide the derived insights immediately •Service Orientation §Offering of services (of cyber-physical systems, humans or Smart Factories) via the Internet of Services •Modularity §Flexible adaptation of Smart Factories to changing requirements by replacing or expanding individual modules
    • 29. Internet of Things (IoT) © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Internet of Things (IoT) •The number of connected devices that can share data is exploding, with estimates of 50-200 billion devices being connected to the Internet by 2020 - a transformative change for our industrial society •With a dramatic growth in connections: §new devices §legacy infrastructures …triggering an unprecedented spike in data volumes, devices & data •That data represents §untapped production efficiencies §competitive business insights §new, brand-differentiating services * but only if the data can be effectively analyzed, and its value unlocked Photo: Siemens
    • 30. The future of IoT and the connected world © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics The future of IoT and the connected world
    • 31. Sensors & software for the automobile © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Sensors & software for the automobile
    • 32. Technology & tomorrow: Future directions (IEEE) © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Technology & tomorrow: Future directions (IEEE) •IEEE New Technologies Connections is a resource to emerging technologies: •Big Data •Brain •Cybersecurity Initiative •Digital Senses •Green ICT •Internet of Things (IoT) •Rebooting Computing •Smart Cities •Smart Materials •Software Defined Networks (SDN) •Cloud Computing •Life Sciences •Smart Grid •Transportation Electrification •www.ieee.org/about/technologies
    • 33. An example of a big data platform in practice (IBM) © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics An example of a big data platform in practice (IBM) ETL, MDM, Data Governance Metadata and Governance Zone Warehousing Zone Enterprise Warehouse Data Marts Ingestion and Real-time Analytic Zone Streaming Data Connectors BI & Reporting Predictive Analytics Analytics and Reporting Zone Visualization & Discovery Landing and Analytics Sandbox Zone Hive/HBase Col Stores Documents in variety of formats MapReduce Hadoop
    • 34. Big Data & Analytics architecture - A broader picture © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data & Analytics architecture - A broader picture
    • 35. Big Data scenarios span many industries © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data scenarios span many industries Identify criminals and threats from disparate video, audio, and data feeds Make risk decisions based on real-time transactional data Predict weather patterns to plan optimal wind turbine usage, and optimize capital expenditure on asset placement Detect life-threatening conditions at hospitals in time to intervene Multi-channel customer sentiment and experience analysis
    • 36. Big Data adoption (emphasis on Hadoop) © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data adoption (emphasis on Hadoop)
    • 37. Factors driving interest in Big Data Analysis © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Factors driving interest in Big Data Analysis
    • 38. Impact of Big Data analytics in the next 5 years © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Impact of Big Data analytics in the next 5 years
    • 39. Big Data: Outcomes & data sources © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Big Data: Outcomes & data sources
    • 40. Data & Data Science © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Data & Data Science •The direction of this course is to head towards a study of Data Science - and that means we will have to study the nature of data, including how it is found, what formats it occurs in, wrangling it, … •During the journey we will study: §The basics of the technology: Hadoop & HDFS, MapReduce & YARN, Spark §Data formats & data movement §The role & work of the Data Scientist §Programming for Big Data §The Hadoop ecosystem - both open source and proprietary approaches §Data governance & data security
    • 41. Facets of data © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Facets of data In data science and big data, you will come across many different types of data, and each of them require different tools and techniques. The main categories of data are: •Structured •Unstructured •Natural language •Machine-generated •Graph-based •Audio, video, and image •Streaming Cielen, D., Meysman, A. D. B., & Ali, M. (2016). Introducing data science: Big data, machine learning, and more, using Python tools. Shelter Island, NY: Manning Publications, pp. 4-8.
    • 42. System of Units / Binary System of Units © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics System of Units / Binary System of Units
    • 43. Introduction to Hadoop & the Hadoop Ecosystem © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Introduction to Hadoop & the Hadoop Ecosystem •Why? When? Where? §Origins / History §The Why of Hadoop §The When of Hadoop §The Where of Hadoop •Hadoop Basics §Comparison with RDBMS •Hadoop architecture §MapReduce §HDFS §Hadoop Common
    • 44. What is Hadoop? © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics What is Hadoop? •Apache open source software framework for reliable, scalable, distributed computing of massive amount of data §Hides underlying system details and complexities from user §Developed in Java •Consists of 3 sub projects: §MapReduce §Hadoop Distributed File System (aka. HDFS) §Hadoop Common •Has a large ecosystem with both open-source & proprietary Hadoop-related projects §Hbase / Zookeeper / Avro / etc. •Meant for heterogeneous commodity hardware
    • 45. A large (and growing) Ecosystem © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics A large (and growing) Ecosystem
    • 46. Who uses Hadoop? © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Who uses Hadoop?
    • 47. Why & where Hadoop is used / not used © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Why & where Hadoop is used / not used •What Hadoop is good for: §Massive amounts of data through parallelism §A variety of data (structured, unstructured, semi-structured) §Inexpensive commodity hardware •Hadoop is not good for: §Not to process transactions (random access) §Not good when work cannot be parallelized §Not good for low latency data access §Not good for processing lots of small files §Not good for intensive calculations with little data
    • 48. Hadoop / MapReduce timeline © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Hadoop / MapReduce timeline
    • 49. Many contributors to Hadoop (e.g., 2006-2011)  © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Many contributors to Hadoop (e.g., 2006-2011)
    • 50. The two key components of Hadoop © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics The two key components of Hadoop •Hadoop Distributed File System = HDFS §Where Hadoop stores data §A file system that spans all the nodes in a Hadoop cluster §It links together the file systems on many local nodes to make them into one big file system •MapReduce framework §How Hadoop understands and assigns work to the nodes (machines)
    • 51. Think differently © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Think differently As we start to work with Hadoop, we need to think differently: •Different processing paradigms •Different approaches to storing data •Think ELT (extract-load-transform) rather than ELT (extract-transform-load) …and to understanding the Hadoop Ecosystem is embark on a continuing learning process…self- education is an ongoing requirement
    • 52. Core Hadoop concepts © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Core Hadoop concepts •Applications are written in high-level language code •Work is performed in a cluster of commodity machines §Nodes talk to each other as little as possible •Data is distributed in advance §Bring the computation to the data •Data is replicated for increased availability and reliability •Hadoop is fully scalable and fault-‐tolerant
    • 53. Differences between RDBMS and Hadoop/HDFS © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Differences between RDBMS and Hadoop/HDFS
    • 54. Requirements for this new approach © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Requirements for this new approach •Partial Failure Support •Data Recoverability •Component Recovery •Consistency •Scalability •Hadoop is based on work done by Google in the late 1990s/early 2000s: Specifically, on papers describing the Google File System (GFS) (published in 2003), and MapReduce (published in 2004)
    • 55. Some terminology…to get you started © Copyright IBM Corporation 2018 Introduction to Big Data and Data Analytics Some terminology…to get you started •75 Big Data Terms Everyone Should Know (July 2017) http://dataconomy.com/2017/07/75-big-data-terms-everyone-know •But these are just the beginning of a terminological dictionary that you should develop for yourself