Friday, January 23, 2015

Hadoop for Cassandra: CqlInputFormat != CqlPagingInputFormat != ColumnFamilyInputFormat


We haven't had cause to write a Hadoop job against Cassandra since the old days of thrift.  (since we introduced Elastic Search in our system)   But this week, we found ourselves needing to get some metrics on data stored in the actual C* tables.

I went to the documentation and found this page:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/configuration/configHadoop.html

That page references:
"CQL partition input format: ColumnFamilyInputFormat class"

I was familiar with the ColumnFamilyInputFormat class from the old thrift days, and I was pretty sure that a new InputFormat was available that used CQL.  I headed over to the code, dropped down to the 2.0 branch and found this:
https://github.com/apache/cassandra/blob/cassandra-2.0/examples/hadoop_cql3_word_count/src/WordCount.java

Notice that WordCount.java imports:
import org.apache.cassandra.hadoop.cql3.CqlPagingInputFormat

I went happily along my way and implemented the MapReduce code using this InputFormat, but the compiler kept complaining that CqlPagingInputFormat could not be found. After some investigation, it looks like that class was removed from cassandra-all, sometime between 2.0.3 and 2.0.11. See below:

➜  tusk  unzip -l /Users/bone/.m2/repository/org/apache/cassandra/cassandra-all/2.0.11/cassandra-all-2.0.11.jar | grep Cql | grep Input
     2882  10-21-14 16:31   org/apache/cassandra/hadoop/cql3/CqlInputFormat.class
➜  tusk  unzip -l /Users/bone/.m2/repository/org/apache/cassandra/cassandra-all/2.0.3/cassandra-all-2.0.3.jar | grep Cql | grep Input
     1359  11-22-13 08:56   org/apache/cassandra/hadoop/cql3/CqlPagingInputFormat$1.class
     2875  11-22-13 08:56   org/apache/cassandra/hadoop/cql3/CqlPagingInputFormat.class

It looks like the crew is already addressing it: https://github.com/apache/cassandra/commit/e550ea60212e933f3849a11717ba4ef916fc4aa3

Hopefully no one else runs into this. ;)