Blueshift
  • Product  
    • PLATFORM
      • SmartHub CDP
      • Omnichannel Orchestration
      • Predictive Intelligence
      • Single Customer View
      • Audience Segmentation
      • 1:1 Personalization
    • SOLUTIONS
      • Email Automation
      • Mobile Marketing
      • Website Personalization
      • Audience Targeting
      • Contextual Chat
    • PLANS AND INTEGRATIONS
      • Integration Partners
      • Support Plans
      • Pricing
  • Customers
  • Resources  
    • Library
    • Blog
    • Videos
    • Documentation
    • Product Updates
    • Blueshift Academy
  • Company  
    • About Blueshift
    • Events
    • News & Awards
    • Careers
  • Contact Us
  • LOGIN
  • Search
  • Menu Menu
  • Product
    • PLATFORM
      • SmartHub CDP
      • Omnichannel Orchestration
      • Predictive Intelligence
      • Single Customer View
      • Audience Segmentation
      • 1:1 Personalization
    • SOLUTIONS
      • Email Automation
      • Mobile Marketing
      • Website Personalization
      • Contextual Chat
      • Retired – Audience Targeting
    • PLANS AND INTEGRATIONS
      • Integration Partners
      • Support Plans
      • Pricing
  • Customers
  • Resources
    • Library
    • Blog
    • Videos
    • Documentation
    • Product Updates
    • Blueshift Academy
  • Company
    • About Blueshift
    • Events
    • News & Awards
    • Careers
  • LOGIN
  • Contact Us

Running Apache Spark on AWS

Development & Engineering
Running Apache Spark on AWS

Apache Spark is being adopted at rapid pace by organization big and small to speed up and simplify big data mining and analytics architectures. First invented by researchers at AMPLab at UC-Berkeley, Spark codebase is being worked upon by hundreds of open source contributors and development is happening at break neck speed. Keeping up with the latest stable releases has not been easy for organization set up on AWS leveraging their vast infrastructure. In this post we wanted to share how we got it going. Hope you find setting up Spark 1.0 on AWS a breeze after this. Here it is in 5 simple steps.

Step 1) login to your favorite EC2 instance and install latest AWS CLI (Command Line Interface) using Pip if you don’t have it yet

$ sudo pip install awscli

Step 2) Configure AWS and setup secret key and access key of your AWS account

$ aws configure

Step 3) Latest AWS CLI comes with many options to configure and bring up Elastic Map Reduce (EMR) Clusters with Hive, Pig, Mahout, Cascading pre installed. Read this article by AWS team on setting up an older version of Spark 0.8.1 to get understanding of what’s involved. We will need to replace the bootstrap script and use latest Amazon AMI to be able to install Spark 1.0. You might also want to restrict cluster access to your VPC and add your SSH keys. Combining all that the install command would look like below. Feel free to read up the options of EMR CLI invocation and edit the command to fit your needs.

$ aws emr create-cluster --no-auto-terminate --ec2-attributes KeyName=key-name,SubnetId=subnet-XXXXXXX --bootstrap-actions Path=s3://elasticmapreduce/samples/spark/1.0.0/install-spark-shark.rb,Name="Spark/Shark" --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --log-uri s3://bucket-name/log-directory-path --name "Spark 1.0 Test Cluster" --use-default-roles

Step 4) The setup takes anywhere from 10 to 20 minutes depending upon the installed packages. You will see the name of the cluster job flow id at the console output. You can login to the master node of the Spark cluster using following CLI command

$ aws emr ssh --cluster-id value --key-pair-file filename

OR

$ ssh -i key-pair-file hadoop@master-hostname

Step 5) After you login you will see soft links in /home/hadoop directory to many of the packages you need. Try

$ ./hive

OR

$ ./spark/bin/spark-shell

You are all set, to terminate the cluster use the following command, let us know in the comments if yo found this information helpful and share your experiences

$ aws emr terminate-clusters --cluster-ids

August 13, 2014/by vijay
Tags: apache spark, aws
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on Pinterest
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://blueshift.com/wp-content/uploads/Running-Apache-Spark-on-AWS-V2.jpg 640 1200 vijay https://blueshift.com/wp-content/uploads/blueshift-primary.svg vijay2014-08-13 09:56:462019-11-22 10:56:36Running Apache Spark on AWS

Recent Articles

  • Using recommendations in online marketplace How to Use Recommendations in Online Marketplace MarketingJanuary 21, 2021 - 6:39 am
  • How to Run a Successful Customer Data Platform RFP: The 50 Essential QuestionsJanuary 20, 2021 - 6:36 am
  • Customer Data Platform experts David Raab and Shamir Duverseau join Blueshift in our exclusive webinar. Webinar Recap: What the CDP?! With the CDP Institute and Smart Panda LabsJanuary 15, 2021 - 5:10 am
  • new retail cx and marketing trends 8 Retail CX and Marketing Trends for the New YearJanuary 13, 2021 - 5:19 am
  • ai recommendations saw an 81% increase in marketing revenue Why You Need AI-Powered Recommendations in Your MarketingJanuary 7, 2021 - 6:07 am
  • hackathon 2020 winners Blueshift’s 2020 Hackathon RecapDecember 18, 2020 - 6:45 am

Headquarters

433 California St, Suite 600,
San Francisco, CA 94104

Global Offices

Charlotte, NC
Pune, India
London, UK

hello@blueshift.com

Company

  • About Blueshift
  • Customers
  • News and Awards
  • Events
  • Careers
  • Contact Us

Platform

  • SmartHub CDP
  • Single Customer View
  • Audience Segmentation
  • Predictive Intelligence
  • 1:1 Personalization
  • Omnichannel Orchestration
  • Integration Partners

Solutions

  • Email Automation
  • Mobile Marketing
  • Website Personalization
  • Audience Targeting
  • Contextual Chat

Resources

  • Documentation
  • Developer Portal
  • Product Updates
  • Case Studies
  • Reports
  • RFP Guide
  • Blueshift Academy

© 2020 COPYRIGHT BLUESHIFT LABS, INC. PRIVACY POLICY   |   TERMS OF SERVICE   |   ANTI-SPAM POLICY

Scroll to top