UPLOAD

    1.9K

    Bridging the Gap between ML and Analytics with Data Lakes - 20 September - 16:00

    Published: October 13, 2019

    AWS Loft Istanbul 2019 Bridging the Cap between ML and Analytics with Data Lakes - 20 September - 16:00

    Comments

    Bridging the Gap between ML and Analytics with Data Lakes - 20 September - 16:00

    • 1. Slide1 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 2. Slide1347 ‹#› © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bridging the Gap between ML & Analytics with Data Lakes A Modern Data Platform Architecture Hasan-Basri AKIRMAK, MSc., Exec-MBA Amazon Web Services
    • 3. Agenda © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda •Business & Technical Challenges •Serverless Data Lake Architectures •ML with Amazon Sagemaker •Integrating ML with Data Pipelines •Demo
    • 4. Slide1264 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML @ AWS OUR MISSION Put machine learning in the hands of every developer and data scientist
    • 5. Slide313 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Application Services Platform Services Frameworks & Infrastructure API-driven services: Vision & Language Services, Conversational Chatbots AWS ML Stack Deploy machine learning models with high-performance machine learning algorithms, broad framework support, and one-click training, tuning, and inference. Develop sophisticated models with any framework, create managed, auto- scaling clusters of GPUs for large scale training, or run inference on trained models.
    • 6. Slide323 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Problem – Yes No Data Augmentation Feature Augmentation The Machine Learning Process Re-training Predictions Data Visualization & Analysis ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging
    • 7. Slide324 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Problem – Yes No Data Augmentation Feature Augmentation Problem discovery Re-training •Help formulate the right questions •Domain Knowledge Predictions Data Visualization & Analysis ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging
    • 8. Slide325 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Visualization & Analysis Business Problem – Yes No Data Augmentation Feature Augmentation Retraining •Need a data platform? •Amazon S3 •AWS Glue •Amazon Athena •Amazon EMR •Amazon Redshift Spectrum Integration Predictions ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging
    • 9. Slide328 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Visualization & Analysis Business Problem – Yes No Data Augmentation Feature Augmentation Retraining Model Training Predictions •Setup and manage Notebook Environments •Setup and manage Training Clusters •Write Data Connectors •Scale ML algorithms to large datasets •Distribute ML training algorithm to multiple machines •Secure Model artifacts ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging
    • 10. Slide329 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Visualization & Analysis Business Problem – Yes No Data Augmentation Feature Augmentation Retraining Model Deployment Predictions •Setup and manage Model Inference Clusters •Manage and Scale Model Inference APIs •Monitor and Debug Model Predictions •Models versioning and performance tracking •Automate New Model version promotion to production (A/B testing) ML problem framing Data Collection Data Integration Data Preparation & Cleaning Feature Engineering Model Training & Parameter Tuning Model Evaluation Are Business Goals met? Model Deployment Monitoring & Debugging
    • 11. Slide1348 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenge of Data Preparation Washing, chopping, slicing
    • 12. Slide1266 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#6493d6c76f63 A recent study by CrowdFlower who surveyed ~80 data scientists about their jobs. . They found Data scientists spend 60% of their time on cleaning and organizing data. Collecting data sets comes second at 19% of their time, meaning data scientists spend around ~80% of their time on preparing and managing data for analysis. And it is not fun at all! Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
    • 13. Challenge of integrating ML with Enterprise Architecture © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenge of integrating ML with Enterprise Architecture Business Application AI Service Real-time fraud detection Automatic loan approval Bank direct marketing Data in Data Lake API Access
    • 14. AI/ML integrated with business application © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AI/ML integrated with business application Business Application AI Service Automatic loan approval Data Lake API Access Business Application AI Service Data Lake API Access Client Mobile client
    • 15. Challenges of Transformations at Scale © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenges of Transformations at Scale Development Environment Raw Data Data scientist keeps development on Jupyter Notebook. The training model and tuning will be literately executed.
    • 16. Customers want more value from their data  © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Customers want more value from their data Growing exponentially From new sources Increasingly diverse Used by many people Analyzed by many applications
    • 17. Cloud Data Lake Architectures © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cloud Data Lake Architectures Customers want: To move to a single store; i.e., a data lake in the cloud To store data securely in standard formats To grow to any scale, with low costs To analyze their data in a variety of ways To democratize data access and analysis Data Lake
    • 18. Building Data Lakes on AWSBroadest and deepest portfolio, purpose-built for builders © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Data Lakes on AWS Broadest and deepest portfolio, purpose-built for builders Migration & Streaming Services Infrastructure Data Preparation & Catalog & ETL Security & Management Dashboards Predictive Analytics Data Warehousing Big Data Processing Interactive Query Operational Analytics Real time Analytics Serverless Data processing Visualization & Machine Learning Data Movement Analytics Data Lake Infrastructure & Management
    • 19. Building Data Lakes on AWSBroadest and deepest portfolio, purpose-built for builders © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Movement Analytics Building Data Lakes on AWS Broadest and deepest portfolio, purpose-built for builders + 10 more Redshift EMR (Spark & Hadoop) Athena Glue (Spark & Python) NEW Visualization & Machine Learning Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Managed Streaming for Kafka Data Lake Infrastructure & Management Elasticsearch Service Kinesis Data Analytics S3/Glacier Glue Lake Formation QuickSight SageMaker Comprehend Lex Polly Rekognition Translate Transcribe Deep Learning AMIs
    • 20. Slide1676 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 21. Slide1687 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Ad-Hoc Method
    • 22. Slide1678 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 23. Slide1679 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 24. Slide1680 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 25. Slide1681 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 26. Slide1682 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 27. Slide1683 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 28. Slide1684 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 29. Slide1685 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 30. Slide1688 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The Reusable & Scalable Method
    • 31. Slide1668 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 32. Slide1667 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 33. Slide1669 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 34. Slide1677 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 35. Slide1670 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 36. Slide1671 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 37. Slide1672 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 38. Slide1673 © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    • 39. Slide1664 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Challenge of Circular Dependency of Data Lakes & ML
    • 40. AWS Glue: Data Catalog & ETL Service © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue: Data Catalog & ETL Service Data Catalog ETL Job authoring Discover data and extract schema Generates customizable ETL code in Python or Scala Automatically discovers data and stores schema Data is immediately searchable, and available to extract, transform, and load (ETL) Automatically generates customizable ETL code Schedules and runs your ETL jobs Serverless
    • 41. Slide65 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue: components Data Catalog §Hive metastore compatible metadata repository of data sources. §Crawls data source to infer table, data type, partition format. Job Execution §Runs jobs in Spark containers – automatic scaling based on SLA. §Glue is serverless - only pay for the resources you consume. Job Authoring §Generates Python code to move data from source to destination. §Edit with your favorite IDE; share code snippets using Git.
    • 42. Amazon Athena—Interactive Analysis © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per- query costs through compression $
    • 43. Ad-Hoc Query from the Data Lake:Finding the aggregate number of events for the last 25+ years © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ad-Hoc Query from the Data Lake: Finding the aggregate number of events for the last 25+ years Notice the data amount scanned? The results are returned by scanning 170+ GB of data from 4000+ uncompressed CSV files on S3. That’s the power of HIVE, Presto and other Hadoop Technologies simplified by Athena Service.
    • 44. Slide92 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy for you to deliver insights to everyone in your organization. Data Scientist (Author) Give power users and analysts the freedom to do their own self-serve data discovery and analysis on governed data you control Dashboard Creator (Author) Create and publish rich, interactive dashboards to all of your users End User (Reader) With the new Reader Role, you can provide everyone in your organization secure, easy access to interactive dashboards and reports, on any device Amazon QuickSight
    • 45. Why Amazon QuickSight? © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easily scale from 10 users to 10,000 QuickSight automatically scales with your usage and activity, with no need for additional infrastructure. QuickSight will grow with your organization’s needs from a few users to tens of thousands of users. No servers to manage Amazon QuickSight has no servers or software to manage, maintain, deploy, upgrade or migrate. We do the heavy lifting so you don’t have to. Native AWS integration Amazon QuickSight securely integrates with your data sources and AWS services like Amazon Simple Storage Service (Amazon S3), Redshift, Amazon Athena, Amazon Aurora, Amazon Relational Database Service (Amazon RDS), AWS Identity and Access Management (IAM), AWS CloudTrail, Amazon Cloud Directory and more - providing you with everything you need to build an end-to-end BI solution. Pay only for what you use Provide read-only access to interactive dashboards and pay only when your users access them with Pay-per-Session pricing. With Amazon QuickSight there are no upfront costs, no annual commitments and no charges for inactive users. Why Amazon QuickSight?
    • 46. Slide1412 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Options for Data Preparation (Ad-Hoc) SageMaker (Reusable) Glue
    • 47. Slide1409 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake Design Patterns ML– Predictions on Batch & Streaming Data
    • 48. Machine Learning—Batch training pipeline © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning—Batch training pipeline Tier 2 S3 Data Lake: Analytics Glue ETL Amazon SageMaker Batch Training S3 Model Artifacts Tier 1 S3 Datalake: Raw Data Glue ETL Amazon SageMaker Endpoint Glue ETL Training Step Model Deployment Data Preparation
    • 49. Machine Learning—Predictions on streaming data © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning—Predictions on streaming data Amazon SageMaker Endpoints Lambda Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Tier 1 S3 Datalake: Raw Data Tier 2 S3 Datalake: Analytics Glue ETL Kinesis Firehose
    • 50. Machine Learning—Predictions on streaming data © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Machine Learning—Predictions on streaming data Amazon SageMaker Endpoints Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Tier 1 S3 Datalake: Raw Data Tier 2 S3 Datalake: Analytics Glue ETL Kinesis Firehose
    • 51. Slide258 © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. ML Architectures
    • 52. Slide268 © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker Notebooks Training Algorithm – Model File Code Commit Code Pipeline SageMaker Managed Hosting: Persistent API Sherpa Raw Input from Customers SageMaker Sample End-to-End Architecture: Real-Time Build Train Deploy Amazon S3 Sherpa Data Store Model Performance Statistics Model Artifact & Logic Data Pre-Processing Visualize Performance Metrics Evaluate SageMaker Training CI/CD Deployment Exploratory Analysis & Model Training Orchestration – Continuous Iteration Sherpa Raw Input from Customers Data Transformation Update Model based on Performance Evaluation Low Latency Near Real-Time Prediction Service Amazon ECR AWS Lambda API Gateway Amazon QuickSight
    • 53. Slide272 © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker Notebooks Training Algorithm – Model File Code Commit Code Pipeline SageMaker Batch Transform SageMaker Sample End-to-End Architecture: Batch Build Train Deploy Model Performance Statistics Model Artifact & Logic Evaluate SageMaker Training CI/CD Deployment Exploratory Analysis & Model Training Orchestration – Continuous Iteration Amazon Athena Amazon S3 Raw Data sources (requires additional automation effort) •Aspect •AWD •Etc… Visualize Performance Metrics CloudWatch Daily Event Trigger Lambda Kicks off Batch Transform Job Updated Predictions Written to S3 Daily End Users Interact with Forecast Update Model based on Performance Evaluation Data Pre-Processing Batch Prediction Service Amazon ECR Amazon QuickSight Amazon QuickSight
    • 54. Slide271 © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1.Model training is completely separate from deployment •From both a process & physical perspective (executed on different infrastructure) •The output of the training process (a model file) powers the deployment process 2.Deployment is more than just a model file •Requires seamless integration into existing application architecture •Deployment closely mirrors traditional software development – training is a more iterative and experimental process 3.Service requirements extend beyond just SageMaker for deployment •Lambda + API Gateway on top of a SageMaker Endpoint a common deployment pattern. Can also implement inference logic directly in inference image via ECR •Database services (RDS, DynamoDB) may be required especially if a model requires complex joins across various data stores 4.There is no customer facing output from the model training process •When the training process satisfies performance criteria the deployment process begins •The training process is continuously iterative – always strive for model improvement Key Takeaways: ML on AWS
    • 55. SummaryServerless Data Lake on AWS for Advanced Analytics and ML ‹#› Summary Serverless Data Lake on AWS for Advanced Analytics and ML Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
    • 56. Please help us improve: Complete the survey! ‹#› Please help us improve: Complete the survey! Bridging the Gap between Machine Learning and Analytics with Data Lakes A Modern Data Platform Architecture 20th of September 2019 16:00-17:00 @Istanbul https://bit.ly/2mqUvzm
    • 57. Slide1689 ‹#›
    • 58. Slide320 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Help us. Please provide feedback to the evaluation survey…