Archive

Archive for the ‘Grid/Distributed Computing’ Category

PDI Scale Out Whitepaper

April 21st, 2009

I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges - one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.

None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?

I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.

Another interesting set of findings in the paper also relates to a very pragmatic approach in my research - I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.

Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:

High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)

If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.

Data Integration (Kettle), General BI, Grid/Distributed Computing, Open Source, Pentaho

First 100 Million Rows done in the “cloud”

November 28th, 2006

My good friend, Matt Casters, posted his results from what we believe to be the first 100 Million Rows of data processed by an ETL tool in the new cloud computing paradigm.  Matt Casters ran a simple 100 Million rows through Kettle on Amazon EC2.

I should really do a write up or review of EC2.  I’m LOVIN’ it and others I’ve introduced to it are LOVIN’ it too!  I just need some spare time (ha ha ha) to write it up.

Grid/Distributed Computing, Open Source

Jini, the silent coming of age

July 1st, 2004

I have always been fascinated by Jini. I’ve attended two of the Jini community meetings and kicked the tires on many of the research projects at jini.org. In many ways it was a technology ahead of it’s time and since it didn’t make a huge SPLASH during the dot com boom, it hasn’t been adopted en masse.
NOTE: many of these links may require registration for the jini.org community and acceptance of the Sun SCSL (which is also a deterrent to the growth of this wonderful technology)

There are many community members much more familiar than I am on the state of Jini adoption. However I do continue to hope for something to happen with regards to it’s uptake. It still seems to be very much on the fringe. Surprising, there are fortune 100 companies using Jini.

There are more airline reservations made on Jini based systems (orbitz, aa.com, nwa.com) than any other electronic system (according to some information from Orbitz). They even won the

Duke’s Choice Awards — Orbitz has been selected as a winner of the 2004 Duke’s Choice Award, recognizing the “best of the best” from among all the cool projects going on in the world of Java technology. Orbitz team members are presenting TS-2614 at JavaOne. See why they won this award.

It’s a good technology that didn’t originally come with the whizbang set of installation wizards that the current frenzy of the dot com era required. It originally required the skill and aptitude of distributed computing engineers to recognize it’s benefits which were firmly placed on the fringe. Some of the projects that are being built on top of Jini offer some great additions to problems that Jini has a competitive advantage in solving.

My personal interest, and if I ever have one of those things that consultants refer to as “Extended Research Periods” would be of how it could address some of the data warehouseing issues I face on a day to day basis. As someone knowledgable with the domain knowledge of a problem (OLAP, huge fact tables, distributed query processing) could I use the current Jini technology and enhancements (such as rio, computefarm, etc) to build a wonderful distributed BI infrastructure? :) Stay tuned… perhaps I’ll have one of those periods of time coming up!

Grid/Distributed Computing

Distributed Database Systems

June 25th, 2004

dds_coverI am currently making my way through the book, “Principles of Distributed Database Systems.” It reads like an academic course textbook, as I imagine was the authors intent.

I find it fascinating… It is also a bit challenging to try and remember what it’s like to be learning new notations and abstracted academic concepts. My day to day is so grounded in building customer solutions (very practical, with good applied technique and concepts) that I have to be deliberate to keep the mind sharp.

One thing that I’m particularly enjoying about this book is that I am seeing some of the concepts that I use my role as a BI Consultant from their starting points. I’m accustomed to interpreting Oracle plans, statistics, etc and now I’m able to relate that to the abstract concepts they represent.

I have to admit though, my interest does have a particular project in mind… I’m always wanting to build something that is more clever, and better than what’s out there now… This book might help me solidify some of those thoughts and add direction to my company R&D focus.

Grid/Distributed Computing