Perfect is the enemy of good enough. This is fertile soil for why people choose to use the simpler, functional, cheaper open source cousins of proprietary feature function behemoths. Don’t get me wrong – too few features / crappy performance you lose customers because you’re not helping people solve problems if you lack too many features.
Recently, I observed a thread at the blog of Goban Saor entitled “Open Source Metrics.”
It basically has turned into a discussion which keeps creeping up about which tool is faster: Talend or Kettle. Which leads me to ask the question: Who Friggin’ Cares?
I’m a Kettle Expert so I think Kettle is Wicked Fast.
If I were a Talend Expert I’d think Talend is Wicked Fast.
Performance for customers who are focused on results, and aren’t technophiles boils down to these two requirements
- It has to meet my performance requirements for my project. If I have to load 1 million records per day and I have 10 minutes to do that then the tool either does or does not meet that performance requirement.
- It has to allow me to grow beyond my current performance requirements. I am loading 1 million records now, but in 3 years I may be loading 100 million records. Given the right investment in tuning and scaling I don’t want to have to change to a different tool when I go much bigger.
For Kettle the answer is pretty simple:
- I do a few simple mappings, hit run, do very little tuning/database optimization. Wham-o. 20k records / second throughput. Look and notice Kettle is simply sitting idle waiting for a database lookup. Add an index. Wham-o 35k records / second throughput. Have extra CPUs, fire up a few extra threads of a calculation step. Wham-o 40k / second. Surpasses customer batch window needs sufficiently; enough said. Requirement met – whether 35k records per second is slower or faster than someone else is irrelevant. Requirement met.
- This usually involves outside validations. What are other people doing – what are the proof points about the performance. I personally have worked on a Kettle scale out cluster with 5 nodes that reads, sorts, aggregates, and summarizes a billion FAT (wide character) records in an HOUR and scales almost perfectly linearly (* no tool grows at perfect linear). Telling a customer using the exact same binary you have there, you can scale out and process hundreds of millions into billions of records per hour. Requirement met – you can grow with your tool.
I think Kettle performance is superb. I’d welcome Talend folks to comment here and blog about their proof points for how Talend performance is superb. I believe that it is. Let’s just all consider the most important thing: open source ETL is about solving the ETL need well, not necessarily incremental performance differences.
It’s a debate with no winner. I don’t care if your tool is 2.5% faster at reading character text files than mine. I do care if it can scale out (requirement 2) and solves customer problems (requirement 1).