November 12th, 2015

Purpose and Agenda

Purpose

Amazon Web Services is a cloud platform that can add to the flexibility and work capacity of any data analytics team.

Our presentation aims to give you:

  • Insight into when cloud solutions are worth pursuing for client work,
  • A review of specific relevant Amazon services and products, and
  • The ability to talk coherently about "the cloud" in non-technical professional settings.

Agenda

Analysis

  • EC2 - Elastic Compute Cloud
  • RDS - Relational Database Service
  • EMR - Elastic MapReduce

Security

  • Cloud Compliance
  • VPC - Virtual Private Cloud
  • IAM - Identity and Access Management

AWS

A Cloud Platform

The cloud is computers you don't own.

Amazon Web Services


11 regions,
28 availability zones,
50 services,
$1.6 billion in revenue.


AWS's capacity is estimated to be four times that of its nearest ten competitors combined, including Microsoft Azure, Google Cloud, and IBM Cloud Services.

Analysis

EC2, RDS, and EMR

EC2

Virtual Servers in the Cloud

What you buy: rentable laptops.

What Amazon says: "Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale cloud computing easier for developers."

When should you consider EC2?

PERFORMANCE COMPUTING
Projects that have short-term or intermittent need for resource intensive processing.

HIGH AVAILABILITY SERVERS
Server-dependent systems that require 99.99%* accessible up-time.

What is on offer?

Rent instances by the hour, priced based on the number of CPUs, amount of RAM, and data storage capacity. Choose the Amazon Machine Images (AMIs) loaded onto each instance, with pre-loaded operating systems and software configurations.

  • CPUs: 1 - 40
  • RAM: 1 GB - 2 TB (yes, really)
  • Cost per hour: 1 cent - 7 dollars

Instance hardware configurations are optimized for general purpose, memory, processor, GPU, or storage use.

Example workflow

  1. Start up an EC2 instance.
  2. Install your software, load your code, and load your data.
  3. (Optional) Save your configuration as an AMI for reuse.
  4. Run your model.

Limits

  • Licensing costs
  • Configuration
  • Training

RDS

Managed Relational Database Service

What it's for: setting up and running your database for you.

What Amazon says: "Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale relational databases in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business."

When should you consider RDS?

ANALYZE LARGE DATA
We have a 100 GB database extract (i.e. structured data) and want to analyze it using SQL.

SINGLE SOURCE OF TRUTH
We want to give multiple people read-only access to the same data at the same time.

DATA SOURCE FOR EC2
Accessing data stored on AWS is faster than accessing data stored on Summit's local network.

What is on offer?

  • CPUs: 1 - 32
  • RAM: 1 - 244 GB
  • Databases: PostgreSQL, MySQL, Microsoft SQL Server, etc.
  • Cost per hour: 2 cents - 2 dollars

Choose the amount of storage, and how fast it is, separately from CPU and RAM.

You can tailor an RDS instance to different workloads. You can also change the configuration on the fly to respond to fluctuating demands and requirements.

Why use RDS instead of a database on your own machines?

  • Higher performance
  • As much as we need, instead of whatever is available
  • We're not database administrators

Example workflow

  1. Start up your RDS instance.
  2. Set up tables for your data within the database running on RDS.
  3. Load your data into the database.
  4. Connect to the database with a SQL client, R, Python, etc. from your local machine.
  5. Submit queries to the database.

EMR

Managed Cluster Framework

What it's for: Process and analyze massive amounts of data.

What Amazon says: "Amazon Elastic MapReduce (EMR) simplifies big data processing, providing a managed Hadoop/Spark framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of your data across dynamically scalable Amazon EC2 instances."

When should you consider EMR

MASSIVE DATA SETS
Process, store, and produce summaries on hundreds of gigabytes of data, quickly.

MACHINE LEARNING
Modeling techniques that would require a single instance to run for weeks.

What is on offer?

Hardware

  • Clusters of multiple (2 - unlimited) EC2 instances
  • Resize clusters while they're running

Software

  • Spark SQL allows you to use SQL to define the summaries you want to perform
  • Spark MLlib allows you to estimate linear and logistic regression models, decision trees, boosted trees, clustering, etc.

Cost per hour: 1 cent to 27 cents per machine per hour, not including the cost of the EC2 instances.

Example workflow

  1. Set up your cluster of 10 instances.
  2. Upload your data to Amazon.
  3. Submit a job to the cluster via Python, Java, R, etc.
  4. Retrieve the job outputs from Amazon storage.

Limits

  • Appropriate for really, really large data sets or really, really complex modeling.
  • Requires using technologies that are non-standard for data analysis.

Security

Cloud Compliance, VPC, and IAM

AWS Security and Privacy Compliance

Cloud Compliance

What you need to know: AWS services are compatible with compliance to HIPAA, PCI, and many more security standards.

What Amazon says: "Amazon Web Services Cloud Compliance enables customers to understand the robust controls in place at AWS to maintain security and data protection in the cloud. As systems are built on top of AWS cloud infrastructure, compliance responsibilities will be shared. By tying together governance-focused, audit-friendly service features with applicable compliance or audit standards, AWS Compliance enablers build on traditional programs; helping customers to establish and operate in an AWS security control environment."

Shared Responsiblity


SECURITY OF THE CLOUD
Amazon's guarantees to its customers about the security of its systems as sold.

SECURITY IN THE CLOUD
Our responsibilities as cloud users in securing our systems.

GovCloud

GovCloud is a special FedRAMP and ITAR compliant region of AWS - physically and logically accessible only from within the United States of America.

Use of GovCloud requires authorization from Amazon.


VPC

Isolated Cloud Resources

What you need to know: only your team can access your AWS resources.

What Amazon says: "Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the Amazon Web Services (AWS) Cloud where you can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways."

What is on offer?

Every AWS product we have discussed relies on an explicit VPC configuration.

For many configurations, AWS VPC is free. The exception is when a configuration requires a router internal to the VPC. Each router costs 8 cents an hour, or about $37 a month.

  • Configure a private network for analysis,
  • With a direct connection to your internal VPN, and
  • A secure public-facing network for client-accessible services.

Limits


Configuring your VPC to be secure requires technical understanding of network security. Assume you will need to budget for several hours of professional IT consultation to review your security, at a minimum, before putting any client work into the cloud.

IAM

Manage User Access and Encryption Keys

What you should hear: you can give your team access to AWS without worrying someone will launch ten $7-per-hour instances before everyone leaves for holiday vacation.

What Amazon says: "AWS Identity and Access Management (IAM) enables you to securely control access to AWS services and resources for your users. Using IAM, you can create and manage AWS users and groups, and use permissions to allow and deny their access to AWS resources."

What is on offer?

Users can be given unique logins, including MFA tokens.

Users can be assigned permissions.

Users can be lumped into groups, which can be assigned permissions.

Permissions define whether users can login to the AWS console, launch different AWS products, access AWS resources, and more.

Review

Analysis

  • EC2 - Elastic Compute Cloud
  • RDS - Relational Database Service
  • EMR - Elastic MapReduce

Review

Security

  • Cloud Compliance
  • VPC - Virtual Private Cloud
  • IAM - Identity and Access Management

Questions?