Connecting Hive and Spark on AWS in five easy steps

Ethernet cables plugged into a router

Hive and Spark are great tools for big data storing, processing and mining. They are usually deployed individually in many organizations. While they are useful on their own the combination of them is even more powerful. Here is the missing HOWTO on connecting them both and turbo charging your big data adventures.

Step 1 : Set up Spark

If you have not setup Spark 1.X on AWS yet, see this blog post for a how to. OR you can download and compile a stand alone version directly from Spark main site. Don’t forget to compile Spark with -phive option. On a stand alone machine the following command would compile them together.

$ cd $SPARK_HOME; ./sbt/sbt -Phive assembly

Step 2 : You presumably have a Hive cluster setup using EMR, Cloudera or HortonWorks distributions. Copy hive-site.xml from your Hive cluster to your $SPARK_HOME/conf/ dir. Edit the XML file and add these properties

  <description>JDBC connect string for a JDBC metastore</description>

  <description>Driver class name for a JDBC metastore</description>

  <value> XXXXXXXX </value>
  <description>username to use against metastore database</description>

  <value> XXXXXXXX </value>
  <description>password to use against metastore database</description>

Step 3 : Download MySQL JDBC connector and add that to SPARK CLASSPATH, the easiest way would be to edit
bin/ script and add this line


Step 4 : Check AWS security groups to make sure Spark instances and Hive Cluster can talk to each other.

Step 5 : Run Spark Shell to check if you are able to see Hive databases and tables.

$ cd $SPARK_HOME; ./bin/spark-shell
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
scala> sqlContext.sql("show databases").collect().foreach(println);

Viola, you have connected two of the most powerful tools for data analysis and mining. Let us know your adventures on Spark and if you are interested in going on an adventure with us we are hiring.