Monday, April 6, 2015

Holy momentum Batman! Spark and Cassandra (circa 2015) w/ Datastax Connector and Java

Over a year ago, I did a post on Spark and Cassandra.  At the time, Calliope was your best best.  Since then, Spark has exploded in popularity.

Check out this Google Trends chart.  That's quite a hockey stick for Spark.
Also notice their github project, which has almost 500 contributors and 3000 forks!!

Datastax is riding that curve.  And just in the time since my last post, Datastax developed and released their own spark-cassandra connector.  

I am putting together some materials for Philly Tech Week, where I'll be presenting on Cassandra development, so I thought I would give it a spin to see how things have evolved.  Here is a follow-along preview of the Spark piece of what I'll be presenting.

First, get Spark running:

Download Spark 1.2.1 (this is the version that works with the latest Datastax connector)

Next, unpack it and build it.  (mvn install)

And that is the first thing I noticed...

[INFO] Spark Project Parent POM
[INFO] Spark Project Networking
[INFO] Spark Project Shuffle Streaming Service
[INFO] Spark Project Core
[INFO] Spark Project Bagel
[INFO] Spark Project GraphX
[INFO] Spark Project Streaming
[INFO] Spark Project Catalyst
[INFO] Spark Project SQL
[INFO] Spark Project ML Library
[INFO] Spark Project Tools
[INFO] Spark Project Hive
[INFO] Spark Project REPL
[INFO] Spark Project Assembly
[INFO] Spark Project External Twitter
[INFO] Spark Project External Flume Sink
[INFO] Spark Project External Flume
[INFO] Spark Project External MQTT
[INFO] Spark Project External ZeroMQ
[INFO] Spark Project External Kafka
[INFO] Spark Project Examples

Look at all these new cool toys?  GraphX?  SQL? Kafka? In my last post, I was using Spark 0.8.1.  I took a look at the 0.8 branch on github and sure enough, all of this stuff was built just in the last year! It is crazy what momentum can do. A

After you've built spark, go into the conf directory and copy the template environment file.


cp spark-env.sh.template spark-env.sh

Then, edit that file and add a line to configure the master IP/bind interface:

SPARK_MASTER_IP=127.0.0.1

(If you don't set the IP, the master may bind to the wrong interface, and your application won't be able to connect, which is what happened to me initially)

Next, launch the master:  (It gives you the path to the logs, which I recommend tailing)

sbin/start-master.sh 

In the logs, you should see:

15/04/06 12:46:39 INFO Master: Starting Spark master at spark://127.0.0.1:7077
15/04/06 12:46:39 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/04/06 12:46:39 INFO MasterWebUI: Started MasterWebUI at http://localhost:8080

Go hit that WebUI at http://localhost:8080.

Second,  get yourself some workers:

Spark has its own concept of workers.  To start one, run the following command:

bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077

After a few seconds, you should see the following:
15/04/06 13:54:23 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/04/06 13:54:23 INFO WorkerWebUI: Started WorkerWebUI at http://localhost:8081
15/04/06 13:54:23 INFO Worker: Connecting to master spark://127.0.0.1:7077...
15/04/06 13:54:25 INFO Worker: Successfully registered with master spark://127.0.0.1:7077

You can refresh the MasterWebUI and you should see the worker.

Third, sling some code:

This go around, I wanted to use Java instead of Scala.  (Sorry, but I'm still not on the Scala bandwagon, it feels like Java 8 is giving me what I needed with respect to functions and lambdas)

I found this Datastax post:

Which lead me to this code:

Kudos to Jacek, but that gist is directed at an old version of Spark.  It also rolled everything into a single class,  (which means it doesn't help you in a real-world situation where you have lots of classes and dependencies) In the end, I decided to update my quick start project so everyone can get up and running quickly.

Go clone this:
https://github.com/boneill42/spark-on-cassandra-quickstart

Build with maven:

mvn clean install

Then, have a look at the run.sh.  This actually submits the job to the Spark cluster (single node in our case).  The contents of that script are as follows:

spark-submit --class com.github.boneill42.JavaDemo --master spark://127.0.0.1:7077 target/spark-on-cassandra-0.0.1-SNAPSHOT-jar-with-dependencies.jar spark://127.0.0.1:7077 127.0.0.1

The --class parameter tells Spark which class to execute.  The --master parameter is the url that you see at the top of MasterWebUI, and tells spark to which master it should submit the job.  The jar file is the result of the build, which is a fat jar that includes the job (courtesy of the maven assembly plugin).  The last two parameters are the args for the program.  Spark passes those into the JavaDemo class.  After you run this, you should see the job process...

15/04/06 17:16:57 INFO DAGScheduler: Stage 8 (toArray at JavaDemo.java:185) finished in 0.028 s
15/04/06 17:16:57 INFO DAGScheduler: Job 3 finished: toArray at JavaDemo.java:185, took 0.125340 s
(Product{id=4, name='Product A1', parents=[0, 1]},Optional.of(Summary{product=4, summary=505.178}))
...
(Product{id=7, name='Product B2', parents=[0, 2]},Optional.of(Summary{product=7, summary=494.177}))
(Product{id=5, name='Product A2', parents=[0, 1]},Optional.of(Summary{product=5, summary=500.635}))
(Product{id=2, name='Product B', parents=[0]},Optional.of(Summary{product=2, summary=994.037}))

I'll go into the details of the example in my next post.
Or you can just come to my presentation at Philly Tech Week. =)

No comments: