Running Apache Spark on AWS

Apache Spark is being adopted at rapid pace by organization big and small to speed up and simplify big data mining and analytics architectures. First invented by researchers at AMPLab at UC-Berkeley, Spark codebase is being worked upon by hundreds of open source contributors and development is happening at break neck speed. Keeping up with the latest stable releases has not been easy for organization set up on AWS leveraging their vast infrastructure. In this post we wanted to share how we got it going. Hope you find setting up Spark 1.0 on AWS a breeze after this. Here it is in 5 simple steps.

Step 1) login to your favorite EC2 instance and install latest AWS CLI (Command Line Interface) using Pip if you don’t have it yet

$ sudo pip install awscli

Step 2) Configure AWS and setup secret key and access key of your AWS account

$ aws configure

Step 3) Latest AWS CLI comes with many options to configure and bring up Elastic Map Reduce (EMR) Clusters with Hive, Pig, Mahout, Cascading pre installed. Read this article by AWS team on setting up an older version of Spark 0.8.1 to get understanding of what’s involved. We will need to replace the bootstrap script and use latest Amazon AMI to be able to install Spark 1.0. You might also want to restrict cluster access to your VPC and add your SSH keys. Combining all that the install command would look like below. Feel free to read up the options of EMR CLI invocation and edit the command to fit your needs.

$ aws emr create-cluster --no-auto-terminate --ec2-attributes KeyName=key-name,SubnetId=subnet-XXXXXXX --bootstrap-actions Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name="Spark/Shark" --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --log-uri s3://bucket-name/log-directory-path --name "Spark 1.0 Test Cluster" --use-default-roles

Step 4) The setup takes anywhere from 10 to 20 minutes depending upon the installed packages. You will see the name of the cluster job flow id at the console output. You can login to the master node of the Spark cluster using following CLI command

$ aws emr ssh --cluster-id value --key-pair-file filename

$ ssh -i key-pair-file hadoop@master-hostname

Step 5) After you login you will see soft links in /home/hadoop directory to many of the packages you need. Try

$ ./hive

$ ./spark/bin/spark-shell

You are all set, to terminate the cluster use the following command, let us know in the comments if yo found this information helpful and share your experiences

$ aws emr terminate-clusters --cluster-ids