Automated R Clusters on AWS

name: Automated R Clusters on AWS
class: transition no-number split-50
count: false
# {{ name }}

.column[.left[
  2016-02-17  
  \#datadc
]]
.column[.right[
  Elizabeth Byerly  
  [@ByerlyElizabeth](https://twitter.com/ByerlyElizabeth)  
  [GitHub](https://github.com/ElizabethAB) -
  [LinkedIn](https://www.linkedin.com/in/elizabethbyerly)
]]

---

This presentation describes a minimum viable R cluster solution for AWS.

1. The Solution
2. Code
3. Next Steps

???

The solution: what motivated me to build out and automate an R cluster on AWS,
what technologies are employed, what does it allow me to do?

Code: switching to looking at the code base underlying the solution, comprised
primarily of Python (for its excellent boto3 API into AWS) and shell scripting.

Next steps: The minimum solution was built for a particular project, but small
additions would allow it to generalize and be more accessible.

---

* Problem
* Technologies

???

Hard work rarely comes out of a vacuum! I'm going to outline the problem I
needed to solve, the constraints under which I operated, and the highest level
summary of the technologies used in my solution.

---

An analytical model that takes 19+ hours to run on AWS's largest single compute
instance, `c4.8xlarge` (36 cores).

### Constraints:
* No change to the core model code
  - Vanilla R running a `parallel::parLapply()` operation
* Analysts should not have to learn or interact with new languages
* Solution must be available to external analysts

???

I work in litigation analytics. We submit our work as part of expert witness
material, so the ability to cite previous use is important to establish the
credibility of our findings. Building bespoke analytical software or
implementations is bad.

The company I work for mostly uses Stata; R represents the bleeding edge of
for them. If using the solution required understanding Java, Python, or Scala,
I'd be the only one who could use it!

Finally, we are working with external teaming partners who would need to be able
to run the same system. A local setup accessible over our VPN was out.

---

* Networking: AWS VPC
* Communication: Sockets
* Data Sharing: NFS
* Access: RStudio Server

???

AWS VPC: Virtual Private Cloud, a shared security group, and a shared placement
group allow for a relatively good proxy for a local area network setup.  
The limitation is that units are not guaranteed to be "close" on the network.

Network sockets are endpoints on a computer network, a handle or reference local
computer programs can access via an API (application programming interface) to
send or receive data. Our master node is going to communicate to its worker
nodes via sockets.

Network file system is a venerable and battle-tested distributed file system
protocol (1984), open standard. We're going to use it to make all of our cluster
nodes share the same home directory.

RStudio Server is easy to setup and use on a remote computer, as long as network
traffic over port 8787 is open.

---

* Setup
* Launching a cluster
* Accessing the cluster

---

---

* Limitations
* Accessibility

???

What Have I Skipped?
 
* Getting data onto AWS
* Running analysis
* Retrieving outputs
 
If you're interested in actually using this solution, check out the GitHub repo
or come and talk to me. If you need to make alterations for your particular use
case and aren't sure where to start, come talk to me! I built it for my needs,
but I would love if the community could derive some additional utility.

Limitations
 
* Not as fast as LAN; not appropriate for data-transfer heavy work
* Not secure
* Clusters aren't saved

Accessibility
 
* Users' ability to specify R packages
* User must have Python 3, boto3, and paramiko installed