November | 2008 | Goodman on BI

Perfect is the enemy of good enough. This is fertile soil for why people choose to use the simpler, functional, cheaper open source cousins of proprietary feature function behemoths. Don’t get me wrong – too few features / crappy performance you lose customers because you’re not helping people solve problems if you lack too many features.

Recently, I observed a thread at the blog of Goban Saor entitled “Open Source Metrics.”

It basically has turned into a discussion which keeps creeping up about which tool is faster: Talend or Kettle. Which leads me to ask the question: Who Friggin’ Cares?

I’m a Kettle Expert so I think Kettle is Wicked Fast.
If I were a Talend Expert I’d think Talend is Wicked Fast.

Performance for customers who are focused on results, and aren’t technophiles boils down to these two requirements

It has to meet my performance requirements for my project. If I have to load 1 million records per day and I have 10 minutes to do that then the tool either does or does not meet that performance requirement.
It has to allow me to grow beyond my current performance requirements. I am loading 1 million records now, but in 3 years I may be loading 100 million records. Given the right investment in tuning and scaling I don’t want to have to change to a different tool when I go much bigger.

For Kettle the answer is pretty simple:

I do a few simple mappings, hit run, do very little tuning/database optimization. Wham-o. 20k records / second throughput. Look and notice Kettle is simply sitting idle waiting for a database lookup. Add an index. Wham-o 35k records / second throughput. Have extra CPUs, fire up a few extra threads of a calculation step. Wham-o 40k / second. Surpasses customer batch window needs sufficiently; enough said. Requirement met – whether 35k records per second is slower or faster than someone else is irrelevant. Requirement met.
This usually involves outside validations. What are other people doing – what are the proof points about the performance. I personally have worked on a Kettle scale out cluster with 5 nodes that reads, sorts, aggregates, and summarizes a billion FAT (wide character) records in an HOUR and scales almost perfectly linearly (* no tool grows at perfect linear). Telling a customer using the exact same binary you have there, you can scale out and process hundreds of millions into billions of records per hour. Requirement met – you can grow with your tool.

I think Kettle performance is superb. I’d welcome Talend folks to comment here and blog about their proof points for how Talend performance is superb. I believe that it is. Let’s just all consider the most important thing: open source ETL is about solving the ETL need well, not necessarily incremental performance differences.

It’s a debate with no winner. I don’t care if your tool is 2.5% faster at reading character text files than mine. I do care if it can scale out (requirement 2) and solves customer problems (requirement 1).

Many readers of this blog use JPivot. The solidly average web based Pivot Viewer that I’ve heard described as a “relic” of the cold war – no frills utility software. However, as maligned as JPivot is, it does have some great features and has been production quality software for years now. One of these hidden little features that is in JPivot (and also in Pentaho) is the quick and easy way to add trend lines to a JPivot screen by simply using MDX.

Consider, for instance, this little bit of MDX:

with member [Measures].[Variance Percent] as ‘([Measures].[Variance] / [Measures].[Budget])’, format_string = IIf(((([Measures].[Variance] / [Measures].[Budget]) * 100.0) > 2.0), “|#.00%|arrow=’up’“, IIf(((([Measures].[Variance] / [Measures].[Budget]) * 100.0) < 0.0), “|#.00%|arrow=’down’“, “#.00%”))
select NON EMPTY {[Measures].[Actual], [Measures].[Budget], [Measures].[Variance], [Measures].[Variance Percent]} ON COLUMNS,
NON EMPTY Hierarchize(Union({[Positions].[All Positions]}, [Positions].[All Positions].Children)) ON ROWS
from [Quadrant Analysis]

which produces this lovely set of arrows letting the user know how their individual variance value rates in terms of KPI thresholds.

The secret of course, is the arrow= tag in the format string. Easy enough. “up” is a green up arrow. “down” is a little red arrow. “none” is no arrow.

Happy Visual Cue Indicator day to you all.

Goodman on BI

Thoughts on Open Source, Analytics

Monthly Archives: November 2008

An arms race my customers don't care about

Hidden little trend arrows