{"id":362,"date":"2009-04-21T16:43:05","date_gmt":"2009-04-21T23:43:05","guid":{"rendered":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/2009\/04\/21\/pdi-scale-out-whitepaper\/"},"modified":"2009-04-21T16:43:05","modified_gmt":"2009-04-21T23:43:05","slug":"pdi-scale-out-whitepaper","status":"publish","type":"post","link":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/2009\/04\/21\/pdi-scale-out-whitepaper\/","title":{"rendered":"PDI Scale Out Whitepaper"},"content":{"rendered":"<p>I&#8217;ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration.  These customers have some big challenges &#8211; one customer was expecting 1 billion rows \/ day to be processed on their ETL environment.  Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I&#8217;m pretty sure however they couldn&#8217;t say since Ab Initio contracts are bizarre).  One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.<\/p>\n<p>None of these customer specifics are public; they likely won&#8217;t be which is difficult for Bayon \/ Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI.  Key questions when evaluating a scale out ETL tool:  Does it scale with more nodes?  Does it scale with more data?<\/p>\n<p>I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper.  Bayon is please to present this free whitepaper, <a href=\"http:\/\/www.bayontechnologies.com\/bt\/ourwork\/pdi_scale_out_whitepaper.php\">Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise.<\/a>  In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.<\/p>\n<p>Another interesting set of findings in the paper also relates to a very pragmatic approach in my research &#8211; I don&#8217;t have a spare 200k to simply buy 40 servers to run these tests.  I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 \/ hour).  Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of <strong>6 USD per Billion Rows processed with zero long term infrastructure commitments.<br \/>\n<\/strong><br \/>\nMatt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:<\/p>\n<p><a href=\"https:\/\/pentaho.webex.com\/mw0305l\/mywebex\/default.do?nomenu=true&amp;siteurl=pentaho&amp;service=6&amp;main_url=https%3A%2F%2Fpentaho.webex.com%2Fec0600l%2Feventcenter%2Fevent%2FeventAction.do%3FtheAction%3Ddetail%26confViewID%3D562356573%26siteurl%3Dpentaho%26%26%26\">High Performance ETL using Cloud- and Cluster-based Deployment <\/a><br \/>\nTuesday, May 26, 2009 2:00 pm<br \/>\nEastern Daylight Time (GMT -04:00, New York)<\/p>\n<p>If you&#8217;re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges &#8211; one customer was expecting 1 billion rows \/ day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[15,6,7,9,11],"tags":[],"_links":{"self":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/362"}],"collection":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/comments?post=362"}],"version-history":[{"count":0,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/posts\/362\/revisions"}],"wp:attachment":[{"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/media?parent=362"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/categories?post=362"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nicholasgoodman.com\/bt\/blog\/wp-json\/wp\/v2\/tags?post=362"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}