Brian ONeill's Random Thoughts: July 2012

Monday, July 23, 2012

Polyglotism. Losing your (rubyist) religion in Philly.

An interesting email came across the Philly Ruby list posing the question, "Is Ruby dead in Philadelphia?", noting that there appears to be more ruby jobs in SF and NYC then Philly. It provoked a good conversation. Here's my take... (echoing a bit of Andrew Libby's sentiment)

Universally, people need their problems solved. The way they solve their problems depends on their priorities. Small budgets and tight timeframes prioritize time-to-market above all else. RoR typically expedites time-to-market. (IMHO) Since startups often find themselves in this situation, they often pickup RoR because it aligns with their priorities. It follows that since NYC and SF both have stronger startup communities than Philadelphia, there is a greater demand for Ruby in those areas.

Outside of that however, I'd suggest that Ruby is alive and well in Philadelphia, but maybe not as visible. Its use may not manifest itself in job postings. More and more often, solutions comprise an assortment of technologies. Even the larger enterprises in Philly (Comcast, Pharma, etc.) are beginning to take a polyglot approach to development. They are using the "best tool for the job" to solve their problem. The days of endless debates arguing over which language is better are waning.

Why compare apples to oranges, when really -- "Sometimes you feel like a nut, sometimes you don't"?

Losing your religious affiliation to a language opens up a world of possibilities.

We've got no religion at Health Market Science. We use ruby, perl, java, clojure, etc. We use ruby, not for its rapid development characteristics, but because it is dynamic. There are pieces of our system that need to be flexible. Some of our code even needs to change at runtime. Ruby fit the bill. But our services platform is written in Java where we get warm fuzzies from type-safety, etc.

Now, what's that mean for job descriptions? Well, I'm not sure. We still tend to put specific technologies in the descriptions, but the expectation is otherwise. We look to hire good software engineers. I'd expect any good software engineer could be productive in any language in short order. That's not to say that knowing idiosyncrasies of languages isn't useful, its just not of paramount importance in today's multi-language, multi-technology environments.

So... if you are looking for a Ruby job in Philly, you may want to look for a Java job in an environment that requires ruby. =)

Wednesday, July 18, 2012

Spring Data w/ Cassandra using JPA

We recently adopted the use of Spring Data. Spring Data provides a nice pattern/API that you can layer on top of JPA to eliminate boiler-plate code.

With that adoption, we started looking at the DAO layer we use against Cassandra for some of our operations. Some of the data we store in Cassandra is simple. It does *not* leverage the flexible nature of NoSQL. In other words, we know all the table names, the column names ahead of time, and we don't anticipate them changing all that often.

We could have stored this data in an RDBMs, using hibernate to access it, but standing up another persistence mechanism seemed like overkill. For simplicity's sake, we preferred storing this data in Cassandra. That said, we want the flexibility to move this to an RDBMs if we need to.

Enter JPA.

JPA would provide us a nice layer of abstraction away from the underlying storage mechanism. Wouldn't it be great if we could annotate the objects with JPA annotations, and persist them to Cassandra?

Enter Kundera.

Kundera is a JPA implementation that supports Cassandra (among other storage mechanisms). OK -- so JPA is great, and would get us what we want, but we had just adopted the use of Spring Data. Could we use both?

The answer is "sort of".

I forked off SpringSource's spring-data-cassandra:
https://github.com/boneill42/spring-data-cassandra

And I started hacking on it. I managed to get an implementation of the PagingAndSortingRepository for which I wrote unit tests that worked, but I was duplicating a lot of what should have come for free in the SimpleJpaRepository. When I tried to substitute my CassandraJpaRepository for the SimpleJpaRepository, I ran into some trouble w/ Kundera. Specifically, the MetaModel implementation appeared to be incomplete. MetaModelImpl was returning null for all managedTypes(). SimpleJpa wasn't too happy with this.

Instead of wrangling with Kundera, we punted. We can achieve enough of the value leveraging JPA directly.

Perhaps more importantly, there is still an impedance mismatch between JPA and NoSQL. In our case, it would have been nice to get at Cassandra through Spring Data using JPA for a few cases in our app, but for the vast majority of the application, a straight up ORM layer whereby we know the tables, rows and column names ahead of time is insufficient.

For those cases where we don't know the schema ahead of time, we're going to need to leverage the converters pattern in Spring Data. So, I started hacking on a proper Spring Data layer using Astyanax as the client. Follow along here:
https://github.com/boneill42/spring-data-cassandra

More to come on that....

Saturday, July 7, 2012

Cassandra-Triggers upgraded to support Cassandra 1.1.2

We just released version 1.0.1 of Cassandra Triggers, upgraded to support Cassandra 1.1.2.
https://github.com/hmsonline/cassandra-triggers

With Cassandra 1.1's Schema Management Renaissance, we felt comfortable with run-time schema creation. Now, Cassandra Triggers automatically creates the requisite column families for you. The system creates the Configuration column family and a pair of column families per host to maintain the event log.

This makes it easier than ever to get triggers setup. The GettingStarted page should be all you need to get up and running with the TestTrigger bundled into the release.
https://github.com/hmsonline/cassandra-triggers/wiki/GettingStarted

As always, let us know if you have any trouble.

Thanks to Andrew Swan for his help uncovering issues and motivating the upgrade.

Tuesday, July 3, 2012

NoSQL/Cassandra Terminology : Risks and Rewards

Recently, there's been growing support to change the terminology we use to describe the data model of Cassandra. This has people somewhat divided and although I've gone on record as supporting the decision. I too am a bit torn. I can appreciate both perspectives, and there are both risks and rewards associated with the switch.

The two controversial terms are Keyspace and Column Family. The terms roughly correlate to the more familiar relational equivalents: Schema and Table. I think that it is a fairly easy transition to change from Keyspace to Schema. Logically speaking, in relational databases, a schema is collection of tables. Likewise, in Cassandra, a Keyspace is a collection of Column Families.

The sticky point is Column Family. Conceptually, everyone can visualize a table as an nxm matrix of data. Although you can mentally map a Column Family into that same logical construct, buyer beware.

The Risks:

A data model for a column-oriented database is typically *much* different from an analogous model designed for an RDBMS. To achieve the same capabilities that a relational database provides on tables, you need to model your data differently to support "standard" relational queries. Assuming a column family has the same capabilities as a table will lead you to all sorts of headaches. (e.g. consider Range Queries and Indexing)

When data modeling, I don't relate column families to tables at all. For me, its easier to think of column families as a map of maps. Then just remember that the top-level map can be distributed across a set of machines. Using that mental model you are more likely to create a data model that is compatible with a column-oriented database. Think of column families as tables, and you may get yourself into trouble that will require significant refactoring.

The Rewards:

With a strong movement towards polyglot persistence architectures, and tools that need to span the different persistence mechanisms, I can see a strong motivation to align terminology. (Consider ETL tools (e.g. Talend), design tools (e.g. Erwin), even SQL clients (e.g. good old Toad))

The popularity of Cassandra's CQL is further evidence that people want to interact with NoSQL databases using tried-and-true SQL (ironically). And maybe we should "give the people what they want" especially if it simultaneously eases the transition for new comers.

The Big Picture:

Theologically, and in an ideal world, I agree with Jonathan's point:
"The point is that thinking in terms of the storage engine is difficult and unnecessary. You can represent that data relationally, which is the Right Thing to do both because people are familiar with that world and because it decouples model from representation, which lets us change the latter if necessary"

Pragmatically, I've found that it is often necessary to consider the storage engine at least until that engine has all the features and functions that allow me to ignore it.

Realistically, any terminology change is going to take a long time. The client APIs probably aren't changing anytime soon, (Hector, Astyanax, etc.) and the documentation still reflects the "legacy" terminology. It's only on my radar because we decided to evolve the terminology in the RefCard that we just released.

Only time will tell what will come of "The Great Cassandra Terminology Debates of 2012", but guaranteed there will be people on both sides of the fence -- as I find myself occasionally straddling it. =)