Blueshift
  • Product  
    •  PLATFORM
      • What is a SmartHub CDP?
      • Omnichannel Orchestration
      • Predictive Intelligence
      • Single Customer View
      • Audience Segmentation
      • 1:1 Personalization
    •  SOLUTIONS
      • Email Automation
      • Mobile Marketing
      • Website Personalization
      • Audience Targeting
      • Contextual Chat
    •  SUPPORT
      • Technology Integrations
      • Product Updates
      • Documentation
      • Support Plans
      • Pricing
  • Customers
  • Resources  
    •   Smart Guides
    •   Case Studies
    •   Videos
    •   Reports
    •   Events
    •   Academy
    •   Blog
  • Company  
    •   About Blueshift
    •   News & Awards
    •   Careers
  • Contact Us
  • LOGIN
  • Search
  • Menu Menu
  • Product
    •  PLATFORM
      • SmartHub CDP
      • Omnichannel Orchestration
      • Predictive Intelligence
      • Single Customer View
      • Audience Segmentation
      • 1:1 Personalization
    •  SOLUTIONS
      • Email Automation
      • Mobile Marketing
      • Website Personalization
      • Audience Targeting
      • Contextual Chat
    •  SUPPORT
      • Technology Integrations
      • Product Updates
      • Documentation
      • Support Plans
      • Pricing
  • Customers
  • Resources
    •   Smart Guides
    •   Case Studies
    •   Videos
    •   Reports
    •   Events
    •   Blog
    •   Academy
  • Company
    •   About Blueshift
    •   News & Awards
    •   Careers
  • LOGIN
  • Contact Us

Connecting Hive and Spark on AWS in five easy steps

Development & Engineering
Connecting Hive Spark on AWS

Hive and Spark are great tools for big data storing, processing and mining. They are usually deployed individually in many organizations. While they are useful on their own the combination of them is even more powerful. Here is the missing HOWTO on connecting them both and turbo charging your big data adventures.

Step 1 : Set up Spark

If you have not setup Spark 1.X on AWS yet, see this blog post for a how to. OR you can download and compile a stand alone version directly from Spark main site. Don’t forget to compile Spark with -phive option. On a stand alone machine the following command would compile them together.

$ cd $SPARK_HOME; ./sbt/sbt -Phive assembly

Step 2 : You presumably have a Hive cluster setup using EMR, Cloudera or HortonWorks distributions. Copy hive-site.xml from your Hive cluster to your $SPARK_HOME/conf/ dir. Edit the XML file and add these properties

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://MYSQL_HOST:3306/hive_{version}</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value> XXXXXXXX </value>
  <description>username to use against metastore database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value> XXXXXXXX </value>
  <description>password to use against metastore database</description>
</property>

Step 3 : Download MySQL JDBC connector and add that to SPARK CLASSPATH, the easiest way would be to edit
bin/compute-classpath.sh script and add this line

CLASSPATH="$CLASSPATH:$PATH_TO_mysql-connector-java-5.1.10.jar

Step 4 : Check AWS security groups to make sure Spark instances and Hive Cluster can talk to each other.

Step 5 : Run Spark Shell to check if you are able to see Hive databases and tables.

$ cd $SPARK_HOME; ./bin/spark-shell
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
scala> sqlContext.sql("show databases").collect().foreach(println);

Viola, you have connected two of the most powerful tools for data analysis and mining. Let us know your adventures on Spark and if you are interested in going on an adventure with us we are hiring.

October 27, 2014/by Vijay Chittoor
Tags: apache spark, aws, hive
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on Pinterest
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://blueshift.com/wp-content/uploads/Connecting-Hive-Spark-on-AWS-V2-1.jpg 640 1200 Vijay Chittoor https://blueshift.com/wp-content/uploads/blueshift-primary.svg Vijay Chittoor2014-10-27 16:00:042019-11-22 10:56:35Connecting Hive and Spark on AWS in five easy steps

Recent Articles

  • Conference Recap: How to Create Continuous Customer EngagementApril 16, 2021 - 8:02 am
  • Learn to Truly Leverage Your Data in Our Smart Guide to Data ActivationApril 14, 2021 - 6:39 am
  • Man receives product recommendations on smartphone, intended to boost engagement. Boosting Engagement Throughout the Customer Experience with RecommendationsApril 8, 2021 - 8:37 am
  • CDPs vs. Marketing Automation Platforms: What Marketers Need to KnowApril 7, 2021 - 6:18 am
  • Welcoming New Customers: How to Leverage Recommendations in Retail and EcommerceApril 1, 2021 - 6:31 am
  • MarTech Recap: 3 Ingredients to Unlocking Next-Level Growth with a SmartHub CDPMarch 25, 2021 - 6:52 am

Headquarters

433 California St, Suite 600,
San Francisco, CA 94104

Global Offices

Charlotte, NC
Pune, India
London, UK

hello@blueshift.com

Company

  • About Blueshift
  • Customers
  • Careers
  • News and Awards
  • Agency Partners
  • Contact Us

Platform

  • SmartHub CDP
  • Single Customer View
  • Audience Segmentation
  • Predictive Intelligence
  • 1:1 Personalization
  • Omnichannel Orchestration
  • Technology Integrations

Solutions

  • Email Automation
  • Mobile Marketing
  • Website Personalization
  • Audience Targeting
  • Contextual Chat

Resources

  • Smart Guides
  • Case Studies
  • Blueshift Academy
  • Product Updates
  • Documentation
  • Developer Portal
  • RFP Guide

© 2021 COPYRIGHT BLUESHIFT LABS, INC. PRIVACY POLICY   |   TERMS OF SERVICE   |   ANTI-SPAM POLICY

Scroll to top