I pre-ordered a copy of the new, (first, only, best, and original) Pentaho book “Pentaho Solutions” by Roland Bouman and Jos van Dongen two weeks back. Saw from a tweet that the book was shipping from Amazon. Cool – had a look at the page. Sure, they can ship today if I get my order in on time so I know they can ship it.
How about my pre order, which I would assume would go out before regular orders? Won’t ship until next week? Delivered by 9/11/2009? Lesson learned – don’t pre order from Amazon. 🙂
Author Archives: ngoodman
Good Riddens Ecto
I was using Ecto, a Mac OS X blog client for the past year or so. Overall I really DIDN’T like it, except that it had the ability to allow me to “Paste” images into the blog using the clipboard. You’d be surprised at how many blog clients still force you save a file on to your computer, then browse to it, and then upload it, then include it in your post. A real drag when you’re trying to do a few simple software screen shots.
I upgraded to a new Mac recently and the Ecto license key I had didn’t come over from the old Mac. According to the Ecto website, just email them and they’ll send it along. I emailed on 8/11/2009 and no response. Not good customer service at all.
But I really wanted to blog about the encrypted variables; I downloaded and gave ScribeFire a try and I’m switching. It can paste images same as Ecto and it’s built right into Firefox. Excellent!
So… Ecto customer service really goofed on this one. You’ve got a customer who isn’t happy, gave me an opportunity to find something I like better, and a chance to blog about the whole thing!
Encrypted Variables in PDI
Every once in a while, I get to sound like a royal arse in front of a customer by saying something “I know” to be true about Pentaho that isn’t. Usually, this is a REALLY good thing because it’s usually some limitation, or Gotcha that existed in the product that has magically disappeared with the latest release. The danger of open source is that these things can change underneath you quickly, without any official fan fare and leave you looking like a total dolt at a customer site. Bad for consultants like me who are constantly having to keep up with extraordinarily fast product development. Good for customers because they get extraordinarily fast product development.
One of these experiences, which I was absolutely THRILLED to look like a dolt about, was
“If you use variables for database connection information, the password will be clear text in kettle.properties.”
A huge issue for many security conscious institutions. Customers were faced with a choice: use variables which centrally manages the connection information to a database (good thing) but then the password is clear text (bad thing). No longer!
Our good friend Sven quietly committed this little gem nearly 18 months ago. It’s been in the product since 3.0.2! It allows encrypted variables to be decrypted in the password field for database connections.
Let’s test it out… our goal here is to make sure we can get a string “Encrypted jasiodfjasodifjaosdifjaodfj” which is a simple encrypted version of the password to be set as a regular ole variable but then be used as the “password” of a database connection.
We have a transformation that will set the variables, and then we’ll use that variable in the next transformation.
The first one sets the variable ${ENCRYPTED_PASSWORD} from a text file. This string would be “lifted” from a .ktr after having been saved that represents the encrypted password.
Then we use it in the next transformation and select from a database, and outputs the list of tables in the database to a text file.
Output – works like a charm!
Customers can now have the best of both worlds. Centralize their variables for host/user/password using variables (including, kettle.properties) and keep those passwords away from casual hackers. I say casual because PDI is open source so in order for someone to decrypted a password they only need know Java, and know where to find PDI SVN. 🙂
As always, example attached: encrypted_variables.zip
CDF Tutorials
The folks at webdetails have posted their Pentaho Community Dashboard Framework tutorials that look great! They run you through building CDF dashboards which is usually a crucial, user facing part of any BI implementations. While much of the work is the ETL/OLAP configuration, tuning, etc on the backend most users think of Pentaho as the dashboard/reports they interact with not the data munching for the Data Warehouse.
These tutorials look great; I’ve implemented more than 20 CDF dashboards at four customers already but I still bought them to learn even more ins and outs. You should too! No better way to learn something than from the source of the technology which in this case is Pedro and team @ webdetails.
Off Topic: My First Airplane Solo
I’ve wanted to fly since I was 14. I started training in 2004, but cut it short when I moved from Boston to Seattle. The wife was kind enough to give me a swift kick in the butt telling me to go get the license now; doing so later in life will be more difficult.
In May I logged 15 hours of dual instruction at Boeing Field (short 15 min drive from my house) and yesterday I soloed the 172 Skyhawk N52139 for the first time.
I was expecting the solo to be intense; thinking intensely and hearing my instructors voice in my head with all the things that have to happen all at the same time to get the plane safely on the ground. Instead, I found that after the training, I know how to simply “fly the plane” and was just doing what I knew the plane needed to land softly without much thinking. After 2 touch and gos and a full stop landing both me and my instructor concluded I’ve acquired the “walking and chewing gum” skills when it comes to flying.
I’ll be continuing training hours over the next couple of months. I’m aiming for September for getting my license so I can take my wife places with me in the plane. After all, her encouragement is what got me going on this old aspiration; least I can do is fly her to dinner in the San Juans.
MDX Humor from Portugal
Pedro Alves, the very talented lead developer behind the Pentaho Community Dashboard Framework gave me a good chuckle with his high opinion of MDX as a language:
MDX is God’s gift to business language; When God created Adam and Eve he just spoke [Humanity].[All Members].Children . That’s how powerful MDX is. And Julian Hyde allowed to use it without being bound to microsoft.
If you haven’t checked out Pedro’s blog, definitely get over there. It’s a recent start but he’s already getting some great stuff posted.
PDI Scale Out Whitepaper
I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges – one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.
None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?
I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.
Another interesting set of findings in the paper also relates to a very pragmatic approach in my research – I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.
Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:
High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)
If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.
NYC 4/9 and 4/10
I’ll be roaming in Manhattan this Thursday and Friday. Have some plans, but actually have some bits of time.
Anyone want to shoot the breeze about Pentaho / BI / Open Source? Email me: ngoodman@ignorethispart.com bayontechnologies.com
Pentaho Partner Summit
I’m at the Westin close to the event space for the summit…
I’m around tonight – meeting Bryan Senseman from OpenBI a bit later (730 or 800pm). Anyone else around and want to meet up for dinner? Email me ngoodman@ignorethispart.com bayontechnologies.com.
Make Mondrian Dumb
I had a customer recently who had very hierarchical data, with some complicated measures that didn’t aggregate up according to regular ole aggregation rules (sum, min, max, avg, count, distinct count). Now, one can do weighted averages using sql expressions in a Measure Expression these rules were complex and they also were dependent on the other dimension attributes. UGGGGH.
Come to that: their analysts had the pristine, blessed data sets calculated at different rollups (already aggregated to Company Regions). Mondrian though, is often too smart for it’s own good. If it has data in cache, and things it can roll up a measure to a higher level (Company Companies can be rolled up to Regions if it’s a SUM for instance) Mondrian will do that. This is desirable in like 99.9% of cases. Unless, you want to “solve” your cube and just tell Mondrian to read the data from your tables.
I started thinking – since their summary row counts are actually quite small.
- What if I could get Mondrian to ignore the cache and always ask the database for the result? I had never tried the “cache=” attribute of a Cube before (it defaults to true and I work with that 99.9% of the world). Seems like setting it to false does the trick. Members are read and cached but the cells aren’t.
- What if I could get Mondrian to look to my summary tables for the data instead of aggregating the base fact? That just seems like a standard aggregate table calculation. Configure an aggregate table so Mondrian will read the Company Regions set from the aggregate instead of the fact
Looks like I was getting close to what I wanted. Here’s the dataset I came up with to test:
mysql> select * from fact_base;
+----------+-----------+-----------+
| measure1 | dim_attr1 | dim_attr2 |
+----------+-----------+-----------+
| 1 | Parent | Child1 |
| 1 | Parent | Child2 |
+----------+-----------+-----------+
2 rows in set (0.00 sec)
mysql> select * from agg_fact_base;
+------------+----------+-----------+
| fact_count | measure1 | dim_attr1 |
+------------+----------+-----------+
| 2 | 10 | Parent |
+------------+----------+-----------+
1 row in set (0.03 sec)
mysql>
Here’s the Mondrian schema I came up with:
<Schema name=”Test”>
<Cube name=”TestCube” cache=”false” enabled=”true”>
<Table name=”fact_base”>
<AggName name=”agg_fact_base”>
<AggFactCount column=”fact_count”/>
<AggMeasure name=”[Measures].[Meas1]” column=”measure1″ />
<AggLevel name=”[Dim1].[Attr1]” column=”dim_attr1″ />
</AggName>
</Table>
<Dimension name=”Dim1″>
<Hierarchy hasAll=”true”>
<Level name=”Attr1″ column=”dim_attr1″/>
<Level name=”Attr2″ column=”dim_attr2″/>
</Hierarchy>
</Dimension>
<Measure name=”Meas1″ column=”measure1″ aggregator=”min”>
</Measure>
</Cube>
</Schema>
Notice that the aggregate for Parent in the agg table is “10” and the value if the children are summed would be “2.” 2 means it agged the base table = BAD. 10 means it used the summarized data = GOOD.
The key piece I wanted to very is that if I start with an MDX for the CHILDREN and THEN request the Parent will I get the correct value. Run a cold cache MDX to get the children values:
Those look good. Let’s grab the parent level now, and see what data we get:
The result is 10 = GOOD! I played around with access methods to see if I could get if messed up and on my simple example it didn’t. I‘ll leave it to the comments to point out any potential issues with this approach but it appears as if setting cache=”false” and setting up your aggregate tables properly will cause Mondrian to be a dumb cell reader and simply select out the values you’ve already precomputed. Buyer Beware – you’d have to get REALLY REALLY good agg coverage to handle all the permutations of levels in your Cube. This could be rough – but it does work. 🙂 And caching – it always issues SQL so that might be an issue too.
Sample: cachetest.zip
Mondrian – you’ve been dumbed down! Take that!!!