UPLOAD

    3.7K

    Deep Dive Disaster Recovery in the Cloud - 26 September - 11:00

    Published: October 16, 2019

    AWS Loft Istanbul 2019 Deep Dive Disaster Recovery in the Cloud - 26 September - 11:00

    Comments

    Deep Dive Disaster Recovery in the Cloud - 26 September - 11:00

    • 1. Slide93
    • 2. Deep Dive: Disaster Recovery IN the Cloud © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep Dive: Disaster Recovery IN the Cloud Serdar Nevruzoglu Solutions Architect serdarn@amazon.com
    • 3. What to expect from this session © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What to expect from this session •300-level content •Disaster recovery as defined IN the cloud •~40 minutes of presentation
    • 4. Agenda © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda •First principles •Availability goals •AWS design for reliability •Failover considerations •Disaster recovery scenarios •Recommendations
    • 5. First principles © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. First principles •Failures are a given and everything will eventually fail over time.. •Expect the unexpected. - Werner Vogels From 10 Lessons from 10 Years of Amazon Web Services
    • 6. What are we really planning for? © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. What are we really planning for?
    • 7. Distributed system design best practices © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Distributed system design best practices Eventual consistency Idempotency Static stability Throttling Exponential fallback Circuit breaking Reliability Pillar whitepaper (Sept 2018)
    • 8. Calculating availability with hard dependencies © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Calculating availability with hard dependencies Application Dependency 1 Dependency 3 99% 90% <= 90% Dependency 2 95%
    • 9. Calculating availability with redundant components © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Calculating availability with redundant components Application Component 1 Component 1 replica 90% 90% ~99% Assuming instantaneous failover. With redundancy
    • 10. Application availability goals © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Application availability goals Common application design goals and the annual length of interruption represented by each availability percentage. Reliability Pillar White paper (Sep 2018) Consider multi- region design
    • 11. AWS design for reliability © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS design for reliability
    • 12. Slide1210 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    • 13. Slide1211 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    • 14. Region-wide AWS services © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Region-wide AWS services Amazon DynamoDB Amazon RDS Amazon ElastiCache Amazon S3 Amazon EFS Amazon SQS Amazon Kinesis Amazon Elasticsearch Default Configurable for multi-AZ deployment
    • 15. Availability design goals – AWS  © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Availability design goals – AWS 99.9958% - 15 minutes 99.9975% 99.9999% 100.0000% 99.9975% 99.9999% 99.9999% 99.9999% Multi(2)-region active-active Availability N/A N/A N/A N/A N/A N/A
    • 16. Availability design goals – AWS (cont)  © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reliability Pillar, AWS Well-Architected Framework, Appendix A: Designed-For Availability for Select AWS Services, Sept 2018, https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf 99.9958% - 15 minutes Availability design goals – AWS (cont) 99.9999% Multi(2)-region active-active Availability 99.9975% 99.9999% N/A N/A N/A N/A N/A N/A
    • 17. Failover considerations © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failover considerations
    • 18. Slide6 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery point Anatomy of a disaster – recovery point
    • 19. Slide8 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disaster Recovery point Data loss Anatomy of a disaster – data loss period
    • 20. Slide10 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Anatomy of a disaster – recovery time Disaster Recovery point Recovery time Data loss Down time
    • 21. Failover considerations - applying decision theory © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Failover considerations - applying decision theory -Minimize your maximum regret -Reputational damage -Lost revenue -Lost customer loyalty -Given a specific maximum regret, what is the value of avoidance? -How much are you willing to invest in avoidance? -Will there be an ROI when an event occurs? -What are your regulatory and compliance requirements? -Multi-AZ -Multi-region -On-premises cold cloud backup -Multi-cloud? Need $ or ROI graphic
    • 22. Recovery potential versus cost © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery potential versus cost RPO/RTO Cost 0 $ $$ $$$ $$$$ Minutes Hours Days Recovery potential
    • 23. NOTE: Disaster recovery is not just about making good backups. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. NOTE: Disaster recovery is not just about making good backups.
    • 24. Cost effective DR: Why not use DR all the time? © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Cost effective DR: Why not use DR all the time? DR environments that don’t get used 1.Fall out of sync, eventually 2.Waste money
    • 25. Disaster recovery scenarios (in cloud) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disaster recovery scenarios (in cloud)
    • 26. Disaster recovery scenarios © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Disaster recovery scenarios Amazon EC2 instance failure Amazon S3 failure or slowdown Amazon DynamoDB failure Amazon RDS DB failure AWS region failure AWS AZ failure Amazon CloudFront failure AWS Direct Connect failure Loss of command/control AWS multi-region failure AWS Lambda failure Amazon Elasticsearch failure
    • 27. Amazon EC2 instance crashed! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 instance crashed!
    • 28. High availability for Amazon EC2 – instance recovery © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. High availability for Amazon EC2 – instance recovery Instance ID, private IP addresses, Elastic IP addresses, and all instance metadata Instance ID, private IP addresses, Elastic IP addresses, and all instance metadata http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html Loss of network connectivity Loss of system power Software issues on the physical host Hardware issues on the physical host that impact network reachability New instance, identical to old instance
    • 29. Slide70 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 Auto Recovery Set your failed check threshold Choose 1-minute period and statistic minimum Choose recover action Metric = StatusCheckFailed_System
    • 30. Slide71 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 Auto Reboot Choose reboot action Metric = StatusCheckFailed_Instance
    • 31. High availability for Amazon EC2 – Auto Scaling © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. High availability for Amazon EC2 – Auto Scaling Availability Zone 1 Availability Zone 2 Auto-scaling Group Fresh server from AMI
    • 32. Amazon S3 failure/slowdown © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 failure/slowdown
    • 33. Slide1217 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 – Cross-Region replication
    • 34. Monitoring Amazon S3 file replication progress © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Monitoring Amazon S3 file replication progress https://aws.amazon.com/answers/infrastructure-management/crr-monitor/ Info on new files Info on replicated files
    • 35. Slide1182 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon
    • 36. Amazon S3 Bucket lost/deleted © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Bucket lost/deleted
    • 37. Amazon S3 data loss concerns © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 data loss concerns •S3 is built for 11 9’s of durability •If you store 10,000 objects, you can on average expect to incur a loss of a single object once every 10,000,000 years. •S3 supports cross region replication •S3 supports versioning •S3 supports MFA delete •IAM roles can also be used to limit access to S3 Amazon S3
    • 38. Amazon S3 bucket backup and restore © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 bucket backup and restore Amazon S3 Amazon Glacier S3 bucket Remote location /mybucket S3 STANDARD_IA 1 2 Lifecycle policy PREP RESTORE $ aws s3 sync /backups s3://mybucket ;Back up and sync the backup folder $ aws s3 sync /backups s3://mybucket --delete ;Like the preceding, but now delete files not present $ aws s3 sync /backups s3://mybucket --delete –storage- class STANDARD_IA ;Like the preceding, but now leverages Infrequent access AWS CLI-based backup, manual DR failover 2 AWS DR Region Amazon EC2 1
    • 39. Regional AWS Lambda failure © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Regional AWS Lambda failure
    • 40. Multi-region AWS Lambda deployment – AWS CodePipeline © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-region AWS Lambda deployment – AWS CodePipeline https://docs.aws.amazon.com/codepipeline/latest/userguide/actions-create-cross-region.html NEW!
    • 41. Multi-region pilot light © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-region pilot light
    • 42. Pilot light concept © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pilot light concept 42 us-east-1 us-east-2 us-east-1 us-east-2
    • 43. Multi-region active-active © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-region active-active
    • 44. Serving a geographically distributed customer base © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serving a geographically distributed customer base China USA West USA East EU-East Users from San Francisco Users from New York Users from London Users from Shanghai
    • 45. Guarding against failure of your applications in one region © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Guarding against failure of your applications in one region Applications in US West Applications in US East Users from San Francisco Users from New York AWS Service 1 AWS Service 2 AWS Service 3 AWS Service 4 AWS Service 1 AWS Service 2 AWS Service 3 AWS Service 4
    • 46. Minimal data replication requirements © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Minimal data replication requirements Does all data need to be replicated? If yes, does it need to replicated synchronously? Does all data need to be replicated continuously?
    • 47. Slide1230 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shared services VPC OREGON REGION IRELAND REGION Shared services VPC AMAZON BACKBONE VPC PEER •No overlapping IP address space •Cross-region connection encrypted VPC VPC VPC VPC VPC VPC VPC VPC VPC VPC VPC VPC
    • 48. Traffic segregation & management © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Traffic segregation & management Segregation options Explicit – different URLs e.g. east.abc-corp.com and west.abc-corp.com Implicit (DNS level) – the same URL e.g. www.abc-corp.com Traffic management infrastructure Throttling Internal redirecting External redirecting
    • 49. Tolerance for network partitioning © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tolerance for network partitioning Failure of one region should not lead to failure of applications in another Regional independence for request serving – no API calls from one region to another Region B Region A Backbone
    • 50. Loss of AWS Direct Connect © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Loss of AWS Direct Connect
    • 51. Redundant Direct Connects         © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redundant Direct Connects https://aws.amazon.com/answers/networking/aws-multiple-data-center- ha-network-connectivity/
    • 52. Use multiple Direct Connect Gateways         © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use multiple Direct Connect Gateways https://docs.aws.amazon.com/directconnect/latest/UserGuide/direct- connect-gateways.html
    • 53. Amazon Elasticsearch failure © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elasticsearch failure
    • 54. Amazon Elasticsearch cross region replication © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Elasticsearch cross region replication Amazon Kinesis Firehose Amazon Kinesis Firehose Amazon ES Amazon ES Source Source needs to have tracking to have successful posting to both regions Region A Region B
    • 55. Amazon RDS failover © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon RDS failover
    • 56. Amazon RDS Multi-AZ deployment © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon RDS Multi-AZ deployment Availability Zone 1 Availability Zone 2 Security group mydb1.abc45345.eu-west-1.rds.amazonaws.com:3306 VPC subnet VPC subnet Synchronous physical replication •Standbys ensure zero data loss in event of the master’s failure. Always have a stand-by •Also note, cross AZ failovers are automatic & fast; whereas cross-region failovers take time and need nontrivial planning.
    • 57. Amazon RDS reliability services © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon RDS reliability services •RDS automatic backup / snapshots •RDS supports cross region read replicas for •MySQL •PostgreSQL •MariaDB •Amazon Aurora MySQL Amazon RDS
    • 58. AWS Database Migration Service (DMS)  © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Database Migration Service (DMS) •Continuous or one-time DB replication to •Amazon EC2 •Amazon RDS •Amazon S3 •Amazon Elasticsearch •Leverage DBMS to replicate your database to AWS or even change your schema from one engine to another. AWS DMS
    • 59. AWS Database Migration Service for granular replication © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Database Migration Service for granular replication DMS Replication instance Source Target Update t1 t2 t1 t2 Transactions Change apply after bulk load Change data capture (supported for MySQL, MariaDB, Aurora, PostgreSQL) Details: http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.html
    • 60. Multi-region monitoring Considerations © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multi-region monitoring Considerations
    • 61. Not all metrics are created equal! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Not all metrics are created equal! High level metric (high importance, low volume): User experience User count Transaction count Replication status Low level metric (relatively low importance, high volume): HTTP request count Read vs. write throughout Cache hit vs. miss
    • 62. High level metrics monitoring © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. High level metrics monitoring Low level metrics monitoring Low level metrics monitoring High level metrics monitoring High level metrics monitoring App monitoring agents App monitoring agents Region A Region B Replicate only high level metrics. Use region tags.
    • 63. Redshift failure © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift failure
    • 64. Redshift cross region replication © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift cross region replication Amazon Redshift Amazon Kinesis Firehose Amazon Kinesis Firehose Amazon Redshift Source Source needs to have tracking to have successful posting to both regions Region A Region B
    • 65. Amazon DynamoDB failover © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon DynamoDB failover
    • 66. Amazon DynamoDB Global Tables First fully managed, multi-master, multi-region database © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon DynamoDB Global Tables First fully managed, multi-master, multi-region database Build high performance, globally distributed applications Low latency reads & writes to locally available tables Disaster proof with multi-region redundancy Easy to set up and no application rewrites required Globally dispersed users Replica (N. America) Replica (Europe) Replica (Asia) Global app Global Table
    • 67. Recommendations © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recommendations
    • 68. Lessons from history © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons from history Plan for more than just what you expect to happen.
    • 69. Lessons from history © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Lessons from history Test your execution plan before you think you can implement it.
    • 70. Words of advice © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Words of advice People generally don’t do well under pressure •Automate as much as you can IMPORTANT •Table top exercises can really help you understand roles and responsibility •Not all services require the same RTO/RPO •If you don’t have a run book, it’s time to make one •If you have one, have you tested it?
    • 71. Conclusions © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Conclusions oAvoid synchronous replication & simultaneous deployments oDesign applications for idempotency & eventual consistency oClosely monitor replication & code sync delays oHave push buttons ready to switch traffic oMake high level metrics monitoring systems also multi-region
    • 72. Slide1176 Please complete the session survey! ! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    • 73. Slide1241 Please complete the session survey! ! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    • 74. Slide1175 Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serdar Nevruzoglu serdarn@amazon.com
    • 75. Slide1178 © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.