Monday, October 28, 2013

Philly Cassandra User's Group : Real-time Analytics w/ Acunu


Tomorrow night, we are set to host another Cassandra meetup.  This time, we will focus on real-time analytics on Cassandra (using Acunu).

See the Philly meetup page for more info:
http://www.meetup.com/Philadelphia-Cassandra-Users/

Friday, October 18, 2013

Stoked to be named to the DataStax Cassandra Rebel Elite team!


Thanks to the Apache Cassandra community and to the crew at DataStax.


I challenge anyone to come up with a more badass sounding name for such a crew of tech-heads.

I feel compelled to go out and buy this guy:
http://www.amazon.com/Star-Wars-Miniatures-Elite-Trooper/dp/B005N569MA






FOR SALE: Content Management System built on Cassandra

Last spring, I attended the Philly Emerging Technology Event (ETE).   It is always a fantastic event, and this year was no different.  One of the new additions this year was an Enterprise Hackathon.  A few large enterprises brought real business problems to the table and offered prizes to help solve them.

The one that caught my eye was the problem of content management for Teva, a large pharmaceutical company.  In looking at the problem of content management, it occurred to me that enterprises these days are dealing as much with internally generated content, as they are externally generated content.

With that in mind, I thought it would be fun to build an application that added the dimension of externally generated content to a content management system.  I took a few weeks and built Skookle. (a play on "Schuylkill", which is the main river through Philly)

Skookle is a new perspective on CMS.  It has an HTML5 user interface that allows users to drag and drop files onto their browser (like dropbox, but through the browser).  The file is persisted, with versioning, into Cassandra using Astyanax chunked object storage, and the description is indexed using Elastic Search.  Skookle is also integrated with Twitter.  It watches for mentions of the company, which then show up directly on the user interface.

For a demo, check out this quick video:
http://bit.ly/18rTMOe

Here is the writeup I did that accompanied the submission:
http://skookle.com/skookle.pdf

That contains an estimate for the work that remains to make this a product.

I definitely think there is a lot of potential here for anyone that can re-envision content management, incorporating externally generated content.  Traditional solutions tend to be internally focused (i.e. Sharepoint) or externally focused (Sentiment analysis, etc.)  There is room for a tool that can bridge that gap to allow a user (e.g. brand manager) to see what people are saying (including competitors) and echo/reinforce the good content, and counter the bad... in real-time.

Anyway...
If anyone wants to pickup where I left off,  drop me a line .  Along with the vision, I'm selling the code base.  =)

I think its a great little seed of a project/idea.
Email me: bone at alumni dot brown dot edu


Thursday, October 17, 2013

Crawling the Web with Cassandra and Nutch


So, you want to harvest a massive amount of data from the internet?  What better storage mechanism than Cassandra?  This is easy to do with Nutch.

Often people use Hbase behind Nutch.  This works, but it may not be an ideal solution if you are (or want to be) a Cassandra shop.   Fortunately, Nutch 2+ uses the Gora abstraction layer to access its data storage mechanism.  Gora supports Cassandra.  Thus, with a few tweaks to the configuration, you can use Nutch to harvest content directly into Cassandra.

We'll start with Nutch 2.1...  I like to go directly from source:

$ git clone https://github.com/apache/nutch.git -b 2.1
...
$ ant


After the build, you will have a nutch/runtime/local directory, which contains the binaries for execution.  Now let's configure Nutch for Cassandra.

First we need to add an agent to Nutch by adding the following xml element to nutch/conf/nutch-site.xml:
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

Next we need to tell Nutch to use Gora Cassandra as its persistence mechanism. For that, we add the following element to nutch/conf/nutch-site.xml:
<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.cassandra.store.CassandraStore</value>
 <description>Default class for storing data</description>
</property>

Next, we need to tell Gora about Cassandra.  Edit the nutch/conf/gora.properties file.  Comment out the SQL entries, and uncomment the following line:
gora.cassandrastore.servers=localhost:9160

Additionally, we need to add a dependency for gora-cassandra.  Edit the ivy/ivy.xml file and uncomment the following line:
<dependency org="org.apache.gora" name="gora-cassandra" rev="0.2" conf="*->default" />

Finally, we want to re-generate the runtime with the new configuration and the additional dependency.  Do this with the following ant command:
ant runtime

Now we are ready to run!

Create a directory called "urls", with a file named seed.txt that contains the following line:
http://nutch.apache.org/

Next, update the regular expression url in conf/regex-urlfilter.txt to:
+^http://([a-z0-9]*\.)*nutch.apache.org/

Now, crawl!
bin/nutch crawl urls -dir crawl -depth 3 -topN 5

That will harvest webpages to Cassandra!!

Let's go look at the data model for a second...
You will notice that a new keyspace was created: webpage.  That keyspace contains three tables: f, p, and sc.

[cqlsh 2.3.0 | Cassandra 1.2.1 | CQL spec 3.0.0 | Thrift protocol 19.35.0]
Use HELP for help.
cqlsh> describe keyspaces;
system  webpage  druid  system_auth  system_traces
cqlsh> use webpage;
cqlsh:webpage> describe tables;
f  p  sc


Each of these tables is a pure key-value store.  To understand what is in each of them, take a look at the nutch/conf/gora-cassandra-mapping.xml file.  I've included a snippet below:
        <field name="baseUrl" family="f" qualifier="bas"/>
        <field name="status" family="f" qualifier="st"/>
        <field name="prevFetchTime" family="f" qualifier="pts"/>
        <field name="fetchTime" family="f" qualifier="ts"/>
        <field name="fetchInterval" family="f" qualifier="fi"/>
        <field name="retriesSinceFetch" family="f" qualifier="rsf"/>

From this mapping file, you can see what it puts in the table, but unfortunately the schema isn't really conducive to exploration from the CQL prompt.  (I think there is room for improvement here)  It would be nice if there was a CQL friendly schema in place, but that may be difficult to achieve through gora.  Alas, that is probably the price of abstraction.

So, the easiest thing is to use the nutch tooling to retrieve the data.  You can extract data with the following command:
runtime/local/bin/nutch readdb -dump data -content

When that completes, go into the data directory and you will see the output of the Hadoop job that was used to extract the data.  We can then use this for analysis.

I really wish Nutch used a better schema for C*.   It would be fantastic if that data was immediately usable from within C*.  If someone makes that enhancement, please let me know!

Thursday, September 5, 2013

The Economics of Open Source : Seek Free Code. Find Innovation.

In the past, many technologists had to evangelize open source within the enterprise. We had to justify its use, reassure executives about security, ability to support, etc.  Recently,  I believe those tables have turned.  More and more, businesses are asking the question, "Isn't there an open-source solution for this?".

Not only has open-source been accepted, but its become the preferred solution.  We are certainly in this position.  Thus far, our technology stack is built entirely open-source components.

I see the following motivators driving this direction:

Control Your Own Destiny

With open-source, you are in control.  You are free to fork the code, extend it, destroy it, whatever your pleasure.  If you need a specific feature, you can develop it yourselves.  Your roadmaps and timelines are your own.

If there is a problem in production, you can diagnose and resolve your own issues.  In many cases, enterprises are larger than the emerging technology vendors.  The enterprises can bring more resources to bear than the technology vendor.  What they may lack in expertise, they can make up for in resources.  (and bringing in consultants if the need arises)

Innovation Enabler

Honestly, this one is always under-valued.  Open-source technologies enable innovation, by attracting talent, and seeding ideas.  Open-source technologies provide a treasure trove of ideas for tinkerers/inventors.  Developers can look under the covers, mine ideas, and come up with new ways to combine and extend the technology.  

It is often impossible to predict innovation, but enabling the right talent with the right technology certainly improves your chances.  Talent often differentiates one company from another.  It is the crazy idea from the morning shower and the ability to execute on it, that keeps technology companies innovating.  IMHO, open-source technology stacks attract the best talent.  Open-source communities are meritocracies that allow the best of the best to succeed, which results in commensurate accolades.  This attracts the best of the best, and that's who you want working/innovating for you.

Support Costs determined by Free Market

Open-source is the anti-monopoly.  With proprietary software vendors, you need to pay their licensing and support costs.  If they are entrenched, the vendor can theoretically charge a large percentage of whatever the replacement cost would be, and the enterprise has no other option but to comply.  That becomes a tough pill to swallow.

With open-source, support contracts are new the license fee.  But unlike license fees,  an enterprise is free to shop around, selecting the best value for the dollar.  Companies providing support must compete on customer service and price point.

The Decision

In the end, an enterprise needs to make a decision.  Inevitably, there will be scenarios where time-to-market, project maturity, and/or core-competencies influence the decision to build, buy or leverage open-source, and those may prevent an enterprise from selecting an open-source solution. But let's take a quick look at the economics of such a decision, considering the motivations outlined above.

Let's assume there is an Open-Source project, OpenZee, that is deficient in feature and function to a proprietary product ClosedOrb.  Let's assume there is an annual license fee of $25K for ClosedOrb.  Over five years, that would cost an enterprise $125K.  Because time-to-market is a concern, for the first year it may make sense to go with ClosedOrb and shell out the $25K, but strategically, at the same time, it might make sense to dedicate a resource to OpenZee to close the feature function gap and replace ClosedOrb in the stack.   And although it might not be feasible to close the gap with a single enterprise's cost-savings (e.g. $100K), if we recognize that other enterprises are in the same situation, it is likely that the a collective effort *can* close the gap.  If there are a couple dozen companies all making the same trade-off, that would result in $2.4M annually of resources to apply to the project.  That rivals many small-company R&D budgets.

In this scenario, OpenZee did not yet have the minimal set of functionality to produce a viable product.   The project needed a set of early adopting companies that were willing to invest in the strategic vision.  But once the project has the minimal amount of functionality, (The 80% in the 80/20 rule) the dynamics change quite a bit.   While the company behind ClosedOrb continues to build out the 20%, more and more companies begin using OpenZee because the criticality of the functional differences begin to diminish.  OpenZee eventually outpaces ClosedOrb and takes over the world. (okay -- that's hyperbole, but you get the point)

The Conclusion

Now, I want to be clear.  Although the scenario above focused on the financial motivation to invest in/contribute to an open-source project, the motivations outlined previously were not based on any financial incentive.  I don't believe enterprises are necessarily looking for "free software" as in "free beer".  The motivations above in fact are derived solely from the code being "free" as in "free speech".  (More on this topic here)

In many cases, enterprises will happily pay for support for software because that is a service that they would otherwise have to fill on their own.  In fact, at larger enterprises, being able to pay for commercial support is actually a requirement for technology adoption!  But as the world shifts from a "why open source" to a "why not open source", more companies are going to take a strategic view and demand open-source solutions to "free" their development/innovation.  They are also going to be more willing to invest in those projects that need it.

This bodes well for projects like Druid, which may just take over the world of Big Data Analytics. (yes, more hyperbole ... maybe... just to emphasize the point =)

I think companies like Datastax have played their cards right.   They understand that with community comes momentum and with momentum comes revenue.  Sometimes that revenue comes directly from licensing proprietary extensions, sometimes indirectly via support, services and education.  The color of the money is the same.

My advice:
Seek free code.  Find innovation.







Tuesday, July 30, 2013

Philly Cassandra User Group to host Matt Pfeil, Datastax Co-Founder


We're excited to host Matt Pfeil, co-founder of Datastax at the Philly Cassandra User Group in August.  Matt is going to talk about Cassandra in Mission Critical Applications.

If you are interested in attending, please RSVP:
http://www.meetup.com/Philadelphia-Cassandra-Users/events/132025492/

Wednesday, July 24, 2013

Broken Glass : Diagnosing Production Cassandra Issues


I just past my second year anniversary at Health Market Science (HMS), and we've been working with Cassandra for almost the entirety of my career here.   In that time, we have had remarkably few problems with it.  Like few other technologies I've worked with, Cassandra "just works".

But, as with *every* technology I've ever worked with, you eventually have some sort of issue, even if it is not with the technology itself, but rather your use of the technology.  And that was the situation here.  (gun? check. foot? check. aim... fire. =)

Here is our tale of when bullet met foot...

Our dependency on Cassandra has increased exponentially since its been in production.  We've been adding product lines and clients to those product lines at an ever-increasing rate.  And with that success, we've had to evolve the architecture over time, but some parts of the system have remained untouched because they've been cruising along.  Over the last couple weeks, one of those parts reared its ugly head.

We've been scaling the nodes in our cluster vertically to accommodate demand.  Our cluster is entirely virtual, so this was always the path of least resistance. Need more memory?  No problem. Need more CPU? No problem.  Need space/disk?  We've got tons in our SAN.  You do that a few times and with increasing frequency, and you can start to see a trend that doesn't end well. =)

First, as we increased our memory footprint, we weren't paying close enough attention to the tuning parameters in: http://www.datastax.com/docs/1.0/operations/tuning

We had our heap size set too large given our system memory, and that started causing hiccups in Cassandra.  Once we brought that back in-line, we limped along for a few more weeks.

Then things came to a head last week.  We saw the cliff at the end of the road.  We found a "bug" in one of our client applications that was inadvertently introducing an artificial throttle.  Fantastic!  We make the code change (2 lines of code), do some testing, and release it to production.  Bam, we increased our concurrency by orders of magnitude. Uh oh, what's that?  Cassandra is choking?

Cassandra started to garbage collect rapidly.  We quickly consulted the google Gods and went to the Oracle (Matrix reference, NOT the DB manufacturer =) for advice:
http://www.slideshare.net/aaronmorton/cassandra-sf-2013-in-case-of-emergency-break-glass

If you have not read through that presentation, do so before its too late.  For performance tuning and C* diagnosis, there is no kung fu stronger than that of Aaron Morton (@aaronmorton).

We started looking at tpstats and cfstats.  All seemed relatively okay.   What could be expanding our footprint?

Well, we have a boat-load of column families.  We've evolved the architecture and our data model, and in the newer applications we've taken a virtual-keyspaces approach, consolidating data into a single large column family using composite row keys.  But alas, the legacy data model remains in production.  Many of those column families see very little traffic, but Cassandra still reserves some memory for them.  That might have been the culprit, but those column families had been there since the beginning of time. We had to look deeper.

We had heard about Bloom Filter bloat, and we thought that might be the issue.  But looking at the on-disk size of the filters (ls -al **/*Filter.db), everything seemed hunky-dory and could fit well within our monstrous heap.  (in 1.2 these have been moved off heap)

Oh wait...
Way back when we had a brilliant idea to introduce some server-side AOP code to act as triggers. Initially, we used them to keep indexes in sync: wide-rows, and even at one point we kept Elastic Search up-to-date with server-side triggers.  This kept the client-side code simple-stupid.  The apps connecting to C* didn't need to know about any of our indexing mechanisms.

Eventually, we figured out that it was better to control that data flow in the app-layer (via Storm), but we still had AOP code server-side to manage the wide-rows.  And despite the fact that I've recently been speaking out against our previous approach, that code was still in there.  Could that be the be root cause?  Our wide-rows were certainly getting wider... (into the millions of columns at this point)

One of our crew (kudos to sandrews) found JMeter Cassandra and started hammering away in a non-production environment.  We attached a profiler, which exposed our problem -- the AOP inside.  Fortunately, we had already been working on a patch that removed the AOP from C*. The patch moved the AOP code to the client-side (point-cutting Hector instead of Thrift/Cassandra). We applied the patch and tested away.

Voila, C* was humming again, and we all lived happily ever after.

A big thanks to +Aaron Morton again for the help.  You are a rock star.
And to the crew at HMS, it's an honor to work with such a talented, passionate team.
Good on ya.







Thursday, June 20, 2013

Evolutionary Architecture : stop divining your architecture, instead evolve it.


With all the activity in the Big Data arena, there are tons of great technical presentations out there.  (kudos to Acunu, Druid, Titan, Priam, Spark, etc.)  With all of the bright, shiny and new toys out there, its easy to overlook the system-level innovations around processes and organizational dynamics enabling all that birght, shiny, and new.  The system-level innovations enable us to stitch the toys together into powerful, yet tenable systems that can evolve over time.

Two presentations in particular have stuck out in my mind:

These presentations resonated with me, because  we are beginning to roll our Big Data platform out across all of our product lines, and across the organization.  The platform is now impacting/enabling all aspects of the business, including: Analytics, Information Management, Infrastructure(IT), and Development Operations (DevOps).  

That roll-out has made it crystal clear that decisions made by software development have a ripple effect throughout the organization.  As that ripple flows through the organization, and we apply the platform to an increasing number of product lines, we are constantly finding new ways to apply the technologies.  Often however, that means changes to the system. (e.g. new data models, new interfaces, components, etc.)  One key to innovation is creating the processes and organizational dynamics that support those changes, some of which may be dramatic.  

And when you partner that with the new and shiny technologies, that also enable innovations...
See Billy Bosworth's quote from USA Today
"There is a new breed of software developer, and he could not care less about relational databases," Bosworth says. "This technology frees a developer to think differently."
You get a powerful combination that allows developers to integrate those new technologies and rapidly evolve their use of those technologies.  (boo yah!)

What's this mean?  Shift architectural focus to the things that will result in an Evolutionary Architecture, one capable of evolving with the business.

It means don't spend a lot of time trying to divine the perfect architecture upfront.  Chances are you'll be wrong.  BUT -  
  • Spend time creating the proper abstraction layers that will allow you to change your decisions down the line, BUT do this at a systems level! 
    • (IMHO -- you don't need to wrap every java module with interfaces and impls, abstractions you'll never use) 
  • Spend time enabling automation & continuous delivery
  • Spend time enabling monitoring and metrics gathering so you can rapidly react to operations
Evolutionary Architecture integrates these non-functional requirements/concepts into application design, standards, and the culture of the company, which means that "architecture" crosses organizational boundaries breaking down borders between Software Development, IT, DevOps. This may result in new org structures (as it did at Netflix)

In a nutshell, build the software from the ground up so:
  • You don't have to spend time supporting it.  (Traditional non-functional reqs: H/A, etc.) 
  • You don't have to spend time deploying it.  (Automation)
  • You can anticipate problems *before* they occur. (Metrics / Monitoring)
  • When problems do occur, you can remediate them quickly.  
    • Soft-Deployments / Roll-back, etc. (Infrastructure / Continuous Delivery)
Yes, these are traditional concepts, but Evolutionary Architecture pulls them forward in the life-cycle and makes them fundamental parts of the application design.  That is one of the reasons we like Dropwizard.  It comes out of the box with patterns for metrics, health-checks, and support for diagnosing issues.  (REST call for thread dumps)

Anyway, incorporating the concept of an evolving architecture into the actual applications themselves enables the organization to focus on the new and shiny.... speaking of which... 





Monday, June 10, 2013

Wednesday, May 22, 2013

Big Data Overview and Cassandra Plunge at Philly JUG


Thanks everyone for coming out last night.  We plowed through a lot of material.

I posted the slides here:
http://www.slideshare.net/boneill42/big-data-phillyjug

Please feel free to ping me directly if you questions. (@boneill42)

Monday, May 20, 2013

C* and the (Big Data Quadfecta)++ @ Philly JUG tomorrow (5/21)


I"m looking forward to presenting on our Big Data platform at the Philly JUG tomorrow.   I hope to give a high-level overview of our use case, with a deep dive into Cassandra, and an architectural overview of our Big Data Quadfecta.

I may even touch on the more recent Storm + C* + Druid integration I have in a proof-of-concept.

What comes after quadfecta anyway? =)

Friday, May 17, 2013

Cassandra as a Deep Storage Mechanism for Druid Real-Time Analytics Engine!


As I mentioned in previous posts, we've been evaluating real-time analytics engines.  Our short list included: vertica, infobright, and acunu.  You can read about the initial evaluation here.

Fortunately, during that evaluation, I bumped into Eric Tschetter at the phenomenally awesome Philly Emerging Technologies Event (ETE).  Eric is lead architect at MetaMarkets and heads up the Druid project

From their white paper:
"Druid is an open source, real-time analytical data store that supports fast ad-hoc queries on large-scale data sets. The system combines a column-oriented data layout, a shared-nothing architecture, and an advanced indexing structure to allow for the arbitrary exploration of billion-row tables with sub-second latencies. Druid scales horizontally and is the core engine of the Metamarkets data analytics platform. "
http://static.druid.io/docs/druid.pdf

At a high-level, Druid collects event data into segments via real-time nodes.  The real-time nodes push those segments into deep storage.  Then a master node distributes those segments to compute nodes, which are capable of servicing queries.  A broker node sits in front of everything and distributes queries to the right compute nodes.  (See the diagram)

Out of the box, Druid had support for S3 and HDFS.   That's great, but we are a Cassandra shop. =)

Fortunately, Eric keeps a clean code-base (much like C*).  With a little elbow grease, I was able to implement a few interfaces and plug in Cassandra as a deep storage mechanism!    From a technical perspective, the integration was fairly straightforward.   One interesting challenge was the size of the segments.  Segments can be gigabytes in size.  Storing that blob in a single cell in Cassandra would limit the throughput of a write/fetch.

With a bit of googling, I stumbled on Astyanax's Chunked Object storage.  Even though we use Astyanax extensively at HMS,  we had never had the need for Chunked Object storage. (At HMS, we don't store binary blobs)  But Chunked Object Storage fits the bill perfectly!  Using Chunked Object storage, Astyanax multithreads the reads/writes.  Chunked Object Storage also spreads the blob across multiple rows, which means the read/write gets balanced across the cluster.  Astyanax FTW!

I submitted the integration to the main Druid code-base and it's been merged into master. (tnx fjy!)

Find getting started instructions here:
https://github.com/metamx/druid/tree/master/examples/cassandra

I'm eager to hear feedback.  Sp, please let me know if you run into any issues.
@boneill42


Tuesday, April 23, 2013

Book Review: Instant Apache Cassandra for Developers Starter from PACKT


My good friend Vivek Mishra asked me to review his new book, Instant Apache Cassandra for Developers Starter. (http://www.packtpub.com/apache-cassandra-for-developers/book)

Vivek is a rockstar, leading Kundera, where he cranks out code that allows people to access Cassandra via JPA. (See: https://github.com/impetus-opensource/Kundera)

His book is an excellent primer on Cassandra.  The initial sections are clear and concise, describing the necessary fundamentals required to get started.    IMHO, to be successful with Cassandra, you need to undersand the distributed storage model.  Vivek does a great job of describing this, and the write path, another critical element.

About half way through, the book transitions, focusing much more on example code.   Vivek's bias creeps in a bit here, focusing heavily on Kundera.   I have mixed emotions about accessing Cassandra from JPA.  But I think its absolutely critical if you are attempting to consolidate storage into a single database.  If you are, Kundera is perfect.  It allows you to use Cassandra like any other relational store.

If instead, you are taking a polyglot approach, or you are using Cassandra specifically for its "NoSQL-ness", then JPA access might obfuscate the power of the simple/scalable data model at the heart of C*.  That however may be changing, given the increased use of CQL, where C* has found a way to expose all the "NoSQL-ness" via a SQL-like interface...  provided you understand how to translate the two!

Regardless, Vivek did a great job with the book.  You will easily save the cost of the book in time getting started with C* and JPA.

BUT...

Be sure to read it through (don't stop at the JPA example!).  Vivek saved the best for last.  I'd say the best nuggets in the book are in the aptly named section: "Top features you'll want to know about" (pg. 29) 

Cassandra's blessing, and its curse, is the wide variety of methods that you can use to access it.  (Hector  & Astyanax (for Thrift), Virgil (for REST), CQL (for SQL), and Kundera (for JPA))  But you can't fault C* for that, its a thriving inventive community applying C* to all sorts of problems.  And given its growth, it may only get worse... but in a good way.  (I still hope to revive Spring Data for C* =)






Monday, April 1, 2013

BI/Analytics on Big Data/Cassandra: Vertica, Acunu and Intravert(!?)


As part of our presentation up at NYC* Big Data Tech Day, we noted that Hadoop didn't really work for us.  It was great for ingesting flat files from HDFS into Cassandra, but the map/reduce jobs that used Cassandra as input didn't cut it.  We found ourselves contorting our solutions to fit within the map/reduce framework, which required developer-level capabilities.  We had to add complexity into the system to do batch management/composition, and in the end the map/reduce jobs took too long to complete.

Eventually, we swapped out Hadoop for Storm.  That allowed us to do real-time cumulative analytics.  And most recently, we converted our topologies to Trident.  Handling all CRUD operations through Storm allowed us to perform roll-up metrics by different dimensions using Trident State.  (Additionally, we can write to wide-rows for indexing, etc.)

This is working really well, but we are seeing increasing demand from our data scientists and customers to support "ad hoc" dimensional analysis, dashboards, and reporting.  Elastic Search keeps us covered on many of the ad hoc queries, but aside from facets, it has little support for real-time dimensional aggregations, and no support for dashboards and reports.

We turned to the industry to find the best of breed.  With some help from others that have traveled this road, (shout out to @elubow), we settled on Vertica, Infobright and Acunu as contenders.  I quickly grabbed VM's from each of them and went to work.

WARNING: What I'm about to say is based on a few days experimentation, and largely consists of initial first impressions.  It has no basis on real production experience. (yet =)

First up was Acunu.  Although each of the VMs functioned as an appliance, when logging into the VM and playing around with things, we were most at home with Acunu.  Acunu is backed by Cassandra.  Having C* installed and running as the persistence layer was like having an old friend playing wingman on an initial first date.  (they can bail you out if things start going south =)

Acunu had a nice REST API and a simple enough web-based UI to manage schemas and dimensions.  Within minutes, I was inserting data from a ruby script and playing around with dashboards.... until something went wrong and the server starting throwing OoM's.  After a restart, things cleared up, but it left me questioning the stability a bit.  (once again, this was a *single* vm running on my laptop, so it wasn't the most robust environment)

Next, I moved on to Vertica.  From a features and functions point of view, Vertica looked to be leaps and bounds ahead.  It had sophisticated support for R, which would make our data scientists happy.  It also has compression capabilities, which will make our IT/Ops guys happy.  And it looked to have some sophisticated integration with Hadoop, just in case we ever wanted/needed to support deep analytics that could leverage M/R.

That said, it was far more cumbersome to get up and running, and felt a bit like I went backwards in time.  I couldn't find a REST API. (please let me know if someone has one for Vertica)  So, I was left to go through the hoop-drill of getting a JDBC client driver, which was not available in public repos, etc.  When using the admin tool provided on the appliance, I felt like I was back in middle school (early 90's) installing linux via an ANSI interface on an Intel 8080.  In the end however, I grew accustomedto their client (vsql) and was happily hacking away over the JDBC driver and it felt fairly solid.

Although we are still interested in pursuing both Acunu and Vertica, both experiences left me wanting.   What we really want is a fully open-source solution (preferably apache license) that we are free to enhance, supplement, etc.... with optional commercial support.

That got me thinking about Edward Capriolo's presentation on Intravert.   If I boil down our needs into "must-haves" and "nice-to-haves", what we really *need* is just an implementation of Rainbird.  (http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011)

AS AN ASIDE:
Does anyone know what happened to Rainbird?  I've been trying to get the answer, to no avail.
http://www.youtube.com/watch?v=84k7o4GdkQg

Now, time for crazy talk...
Intravert provides a slick REST API for CRUD operations on Cassandra.  As I said before, I'm a *huge* REST fan.  It provides the loose-coupling for everything in our polyglot persistence architecture.    Intravert also provides a loosely coupled eventing framework to which I can attached handlers.   What if I implemented a handler, that took the CRUD events, and updated additional column families with the dimensional counts/aggregations???    If I then combine that with a javascript framework for charting, how far would that get me?  (60-70% solution?)

To be clear, I'm not bashing Vertica or Acunu.  Both have solid value propositions and they are both contenders in our options analysis.  I'm just mourning the fact that there seems to be no good open-source solution in this space like there are in others.  (Neo4j/TitanDB for graphs, Elastic Search/SOLR for search, Kafka/Kestrel for queueing, Cassandra for Storage, etc.)

We are also considering Druid and Infobright, but I haven't gotten to them yet:
https://github.com/metamx/druid

Please don't bash me for early judgments.
I'm definitely interested in hearing people's thoughts.



Big Data Quadfecta @ Philly Emerging Technologies Event


For those that missed the presentation at NYC* Big Data Tech Day, I'll be giving an abbreviated version as a lightning talk at the Philly Emerging Technologies event on Wednesday.
(http://phillyemergingtech.com/2013)

It looks like they are going to have another strong line up this year, heavy on the language and framework wars.  Hit me up @boneill42 if you want to connect.


Monday, March 25, 2013

The Art of Platform Development: Scaling Agile with Open Source dynamics


I’ve spent nearly my entire career working on “platform teams”.   The teams went by different names including “Shared Services”,  “Framework”,  “Core”, etc.  But the goal was always the same:  to centralize capabilities development and leverage those capabilities across product lines.

Achieving that goal is incredibly valuable.  By sharing capabilities, you can eliminate code and infrastructure duplication, which decreases operations and maintenance costs.   It also ensures consistency and simplifies integration across the products, which decreases the overall system complexity and expedites the delivery of new products to market.  In this model, “product development” can become an exercise in capabilities composition.

Unfortunately, this model challenges traditional Agile development.  Typically, Agile works from a product backlog, which is controlled by the Product Owner.  The Product Owner is focused on the business value of the stories (typically functionality).    Priorities are often driven by market opportunity and customer value.   With multiple products and product owners, where does the platform live?   

Often, the drive to keep the teams isolated and focused on the customer functionality results in silo’d development and silo’d products.  One might argue that such dynamics will always result in a fractured architecture/platform.

Some enterprises solve this problem by creating a Platform backlog and Platform team, which takes on all common service development.  This can work, but it is a nightmare to coordinate and often bottlenecks development. 

Furthermore, since prioritization of functionality is done within each product backlog, the result is local optimization.   It would be better if the enterprise could prioritize work globally, across all products and then optimize the assignment of that work across all development teams.

In the slides below, I suggest a different model, whereby product demand is collapsed into a single pivoted backlog that focuses on capabilities instead of specific product functionality.  Then prioritization is driven by the collective value of that capability across product lines.

With this pivot however, we lose the affinity between a team and its Product Backlog.  To fix that, I suggest the teams take an open-source approach to development.  Any team can take on any story, and contribute capabilities back to the “platform” via git pull requests to the appropriate component.

In this model, “platform development” is no longer the bottleneck.  Instead, all the teams share the platform, which eliminates the “us vs. them” mentality that can develop and establishes the proper dynamics to support the development of a single cohesive platform.   (aligning with Conway’s law)

Anyway, it’s just some food for thought.  I’d love to hear what people think.
http://www.slideshare.net/boneill42/the-art-of-platform-development

Wednesday, March 13, 2013

Big Data goes to the Big Apple



Next week is the NYC* Big Data Tech Day.  It looks even bigger and more badass than last year's Cassandra Summit.  There is a good blend of use cases from people that already have Cassandra in production as well as cutting edge development that hasn't yet gone mainstream.

I'm really looking forward to John McCann's talk.  He is going to present on Comcast's use of Cassandra as the backend of their DVR system.   Netflix has been fairly open about their use, but guiding a ship as big as Comcast into the new era of NoSQL is a feat of shepherd-ship I'd love to hear about.

Likewise, on the more cutting-edge side of things, I'm looking forward to Thomas Pinkney's talk on graph-databases.  That may be the next ingredient in our architecture.  We were targeting Neo4j, but if a Cassandra-based graph database  is mature enough we would love to use it.  That would allow us to centralize on a single storage mechanism that scales. (FTW!)

Finally, Ed Capriolo's talk promises to be a good one.  As the veteran's know, once you've got your CRUD operations down.  There is a whole world of potential out there in data processing.  On our second try, we decided to go with Storm for our data processing layer, but I believe Ed has an innovative perspective on things.  (Hopefully, we'll see mention of intravert-ug, which is what happens when smart people like Nate McCall, Ed Anuff and Ed Capriolo have a baby.)

Also, I should mention that Taylor Goetz and I will be presenting on our Big Data journey, which has culminated in a Big Data platform that we are extremely happy with, where we've combined Storm, Kafka, Elastic Search and Cassandra into a slick/fast/scalable/flexible data processing machine.

I believe there is still room if you want to sign up.  If it is anything like last year,  not only will the talks be informative, but the collaboration sessions  before, in-between and after are worth their weight in gold.

Friday, February 8, 2013

Set to host the first Philly Cassandra Users Group!


We finally got around to arranging the first Philly Cassandra User's Group meetup. 

Sign up here:
http://www.meetup.com/Philadelphia-Cassandra-Users/events/103340882/

I plan to give a brief Cassandra overview.  Then I'll hand off to Taylor Goetz who will talk about storm-cassandra.  Food and drink will be provided by Datastax.

All are welcome.


Wednesday, February 6, 2013

InvalidRequestException(why:Too many bytes for comparator)


In the spirit of trying to save people time, I thought I would directly address the numerous "Too many bytes" errors that bubble up out of Astyanax when using Composites, especially when trying to perform range queries.

First, see my last two posts about connecting the dots:
http://brianoneill.blogspot.com/2012/09/composite-keys-connecting-dots-between.html
http://brianoneill.blogspot.com/2012/10/cql-astyanax-and-compoundcomposite-keys.html

After that, you'll probably still have issues.  If you see a "Too many bytes for comparator" it most likely means that you have a mistmatch between your PRIMARY KEY declaration in your CQL CREATE TABLE statement, and the composite you are using in Astyanax.

You have to be really careful that all of the components in the primary key are declared in your class *in order* and that all fields in your class are part of your primary key.  Otherwise, when Cassandra goes to compare two column keys/names, it will generate the bytes for that column name (from all the components) and it will end up with too many or too few.  (Hence the error you are seeing)

If you have too many, there is a good chance that you have a field declared in your annotated composite class, that is not part of your primary key declaration.

Hopefully that helps.  If you want the underlying theory/reasoning, see:
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts

If it still takes you a while to get sorted out, accessing a CQL table from a thrift-based client, no worries... you are in good company:
http://stackoverflow.com/questions/12360067/invalidrequestexceptionwhytoo-many-bytes-for-comparator-on-execute-query-to-c

Activiti TimerEvent only firing once.



This is just a mental note since I wasted hours on it a few months ago.  Then a colleague just did the same.

If you are developing a business process in Activiti, and your TimerEvent is only firing once, add the following parameter to your Activiti config:

<property name="jobExectuorActivate" value="true"/>

This is mentioned in the documentation, but only as a very small note:
http://www.activiti.org/userguide/index.html#timerEventDefinitions

"Note: timers are only fired when the job executor is enabled (i.e. jobExecutorActivate needs to be set to true in the activiti.cfg.xml, since the job executor is disabled by default)"

Hopefully this saves people some time.

Friday, January 25, 2013

Zen and the Art of Collaborative Software Development



Conway's law suggests that designs are constrained by organizational communication structures.  I've seen that law manifest itself over and over again and I'd assert that it is impossible to develop a cohesive software platform unless the proper collaborative dynamics exist.  Specifically, to develop a software platform that can satisfy the needs of many different product-lines, consumers, and/or dependent projects, you want those dependent projects to be able contribute back and co-develop the platform.  This approach shares ownership, shortens the development life-cycle, and enables innovation across the organization.

It follows that the dynamics required to develop a platform are different from normal silo'd team dynamics.  The dynamics you need mimic that of the open-source community.  Developers need to be good citizens in a larger community.  Here is what I think that means:

First.  Be Self-Aware. 

There are four stages to mastery: Unconscious Incompetence, Conscious Incompetence, Conscious Competence, and Unconscious Competence.   It is very important to know where you are on that progression before you interact with a community. 

If you aren’t self-aware, you run the risk of making an unfounded assertion when a question may have been more appropriate.  (We all know the A-hole that emails a discussion list making claims before doing his/her homework)   Thus, I’d recommend always starting from the conscious incompetence perspective and communicate with that tone.  If you are new to a project, communicate via questions to confirm assumptions before making assertions.

Once you’ve achieved conscious competence, help others out!  Take questions from others, and propose solutions to them politely and in an open audience.  Everyone will benefit from the ensuing discussion and it will enable innovation.  You may have a solution that others can improve upon, but the tone should remain propositional.

As you progress to Unconscious Competence, swtich from proposing solutions to delivering them.  Instead of simply proposing solutions in email, you should be submitting pull requests with working code.

Second. Be aware of a project’s maturity.

Early on, projects are trying to pickup momentum.   They may be throwing stuff at a wall to see what sticks.  It is important to recognize that. Often, in the early stages of a project the participants are trying to demonstrate the most amount of value in the shortest amount of time, which is one way to get a project funded / off-the-ground.   If a project is in that state, complaining about configurability and elegance of interface might not be the best idea.

Third.  Be aware of others.

(IMHO) Passionate rock-star developers are often arrogant and obsessive-compulsive.  Those great developers want things their way, and they believe they have the best solution.  (Myself included, I must have been an asshole to work with early in my career)

As you start collaborating with larger communities of developers, you realize that beauty is in the eye of the beholder.  You can appreciate other’s perspectives with an increased tolerance for other ways of doing things. (other coding styles, languages, and best practices)

Finally, I think you get to a place where you can listen to others’ ideas without feeling an immediate compulsion to improve upon them.  This is powerful, especially for seedling ideas.  Passion for an idea is a fickle thing.  Sometimes its more important to keep your mouth shut, and let a peer evolve an idea before suggesting improvements and vocalizing all the nuances, edge cases, and counter examples that might make it difficult.  You never know what might grow out of any random thought.

As a corollary, it’s important that people feel welcome to bring ideas out into the open.   If other people don’t feel that they can bring ideas to you, or you feel you cannot bring ideas to them, it is YOUR fault and no-one else's  Its important for each citizen to own that dynamic and ensure the atmosphere is conducive to innovation.

IMHO, these dynamics are essential in any successful collaborative community.  Furthermore, such dynamics are cultivated by successful benevolent dictators. (shout-out to @zznate and @spyced, two of the better dictators I’ve met.)  

We had a great discussion on this topic at Health Market Science (HMS), where we do a lot of open source work.   Incase anyone is interested, I posted the slides that drove the conversation. 
http://www.slideshare.net/boneill42/collaborative-software-development

Hopefully people find it useful.

(tnx to @jstogdill for the mentoring over the years)





Sunday, January 20, 2013

Webinar on Event Processing on Cassandra w/ Storm



Thanks again to all those that made it to the webinar on Thursday.  It was a lot of fun tag teaming with Taylor Goetz.  Storm-cassandra has come a long way.  The slides and video are now available.


Slides:

Video:

As always, please shout if you have any additional questions, or if we got something wrong.


Wednesday, January 2, 2013

Native support for collections in Cassandra 1.2! (no more JSON blobs?)


In case you haven't seen it yet,  Apache released Cassandra 1.2:


I'm stoked.   Presently we write maps/lists as text blobs in Cassandra using JSON.  This has obvious limitations.  In order to add something to the map/list, we need to read the list, append the data then write the data back to C*.  Read before write is not good, especially in a large distributed system. Two clients could read at the same time, append, and the second write would effectively remove the element added by the first.  Not good.

Although I think the Thrift support is a bit clunky (via JSON), CQL3 supports native operations on collections. 

Now, we just need to figure out how to migrate all of our data. =)
-brian

Creating Your Frist Java Application w/ Cassandra


Looking to kick off 2013 with a fun project to get familiar with Cassandra?  Why don't you build a globally scalable naughty and nice list?  At least, that's what we did in the webinar a few weeks ago. 

If you would like to relive the webinar, Datastax posted the slides and video.

We went through two quick examples using Astyanax and CQL3 to manage Santa's naughty list.  As much as it was a demo of the APIs, it also shows how you can use cqlsh and cassandra-cli to get two perspectives on the data: a logical view via cqlsh and a physical view via cli.   Its important to keep both perspectives in mind if you want to build a scalable app.

You can find the code here:
https://github.com/boneill42/naughty-or-nice

Please let me know if you have any questions, or if I flubbed up on anything.

BTW, Jonathan has a webinar coming up on "What's new in 1.2?"  It should be a good one.