Monthly Archives: June 2006

Donating to Open Source: Gratitude

A while back I blogged about gratitude and generosity, which was mostly about how it made ME feel when I was experiencing those feelings in times of change and growth.  What’s the flip side of that coin, or the other end of that stick, or whatever metaphor you want to use?  How does expressing gratitude to others for what they do feel?

Apparently pretty good; or at least good enough to respond with some very kind, personal notes of thanks.  A few weeks back I realized that I use two open source projects that provide exceptional products.  Truly, they’ve transcended the open source motto of "the code is the documentation and RTFM if there were one" and have created wonderful, easy to use products.  I realized that I had not given these people anything in return (I never encountered any bugs/etc to submit patches for!).

I donated, via their website instructions, to a CYGWIN developer and Gallery.  I received personal notes of thanks, expressing real gratitude.  It wasn’t for the money either (I donated 25 USD to each developer) but more of recongition of their contribution.  I get this.  If someone (me) is willing to pay someone whom they’ve never met before, willing to seek out the method (donation pages and paypal hoops), part with real money, while they’re under absolutely NO OBLIGATION or expectation to means that I think they did a great job.

Well they have! 

Have you ever considered donating to an open source project?  What open source projects do you get value from?  Consider dropping them $20 and see how good it makes them AND you feel!  I bet you’ll feel better giving $20 to the Apache foundation than paying your next enterprise software bill.

Kettle and Pentaho: 1+1=3

Like all great open source products, Pentaho Data Integration (Kettle) is a functional product in and of itself.  It has a very productive UI and delivers exceptional value as a tool in and of itself.  Most pieces of the Pentaho platform reflect a desire to keep the large communities around the original projects (Mondrian, JFree, etc) engaged; they are complete components in and of themselves.

When used together their value, as it relates to building solutions increases and exceeds their use independently.  I’ll be the first to admit that Pentaho is still fairly technical, but we’re rapidly building more and more graphical interfaces and usability features on top of the platform (many in the open source edition, but much is in the professional edition).  Much of this work involves making the "whole" (Pentaho)  work together to exceed the value of the pieces (Mondrian, Kettle, JFree, …).

A few things immediately come to mind of why Pentaho and Kettle together provide exceptional value as compared to used individually or with another open source reporting library:

  1. Pentaho abstracts data access (optionally) from report generation which gives report developers the full POWER of Kettle for building reports.

    There are some things that are tough, if not downright impossible to do in SQL.  Ever do an HTTP retrieval of an XML doc, slurp in a custom lookup from Excel, do a few database joins and analytical calculations in a SQL statement?  I bet not.  Report developers are smart data dudes; having access to a tool that allows them to sort/pivot/group/aggregate/lookup/iterate/list goes on and on/etc empowers report developers in a way that a simple "JDBC" or "CSV" or "XQuery" alone can accomplish. 
    How is this made possible?
    Pentaho abstracts (optionally, it isn’t forced on customers) the data retrievals to lookup components.  This allows BI developers to use either a SQL lookup (DB), XQuery lookup(XML), MDXLookup (OLAP), or Kettle lookup (EII) to populate a "ResultSet."  Here’s the beauty; reports are generated off a result set instead of directly accessing the sources.  This means that a user can use the same reporting templates, framework, designer, etc and feed/calculate data from wherever they desire.  Truly opens a world of possibiliy where before there was "just SQL" or "ETL into DB tables."

  2. Ability to manage the entire solution in one place

    Pentaho has invested greatly in the idea of the solution being a set of "things" that make up your BI, reporting, DW solution.  This means you don’t have ETL in one repository, reports managed somewhere else, scheduling managed by a third party, etc.  It’s open source so that’s obviously a choice, but we can add much value by ensuring that someone who has to transform data, schedule that, email and monitor, secure, build reports, administer email bursting, etc can do some from one "solution repository." Managing an entire BI solution from one CVS repository?  Now that’s COOL (merge diff/patch anyone?).

  3. Configuration Management

    Kettle is quite flexible; the 2.3.0 release extends the scope and locations where you can use variable substitution.  From a practical standpoint this means that an entire Chef job can be parameterized and called from a Pentaho action sequence.  For instance, because you can do your DW load from inside Pentaho action sequences that means you can secure it, schedule it, monitor it, initiate it from an outside workflow via web service, etc.  In one of my recent Kettle solutions ALL OF THE PHYSICAL database, file, and security information was managed by Pentaho so the Kettle mappings can literally be moved from place to place and work inside of Pentaho. 

  4. Metadata and Additional Integration

    Pentaho is investing in making the tools more seamless.  In practice (this is not a roadmap or product direction statement) this means being able to interact with tables, connections, business views inside of Kettle in an identical (at least similar way) in the report designer.  For example, if you’ve defined the business name for a column to be "Actual Sales" Kettle and the Report Designer can now key off that same metadata and present a "consistent" view to the report/ETL developer instead of knowing that "ACT_SL_STD_CURR" is actual sales. 
    Another example is the plans to do some additional Mondrian/Kettle integration to make the building of Dimensions, Cubes, and Aggregates easier.

Free, Valuable, DW Wisdom from man in lilac suit

For those that don’t know the reference between lilac suit and Pete-s, just google it.  :)  Doesn’t really matter though when compared with the great set of articles that my friend Pete-S is pumping out from the other side of the Atlantic.

Pete has in the trenches practical knowledge building BI and DW systems.  He’s both sharp, and practical (that’s rare you know!).  He’s running a series DW Wisdom and it’s some very useful content.

DW Wisdom

I like that Pete is comfortable enough with his own skills/abilities to question the "age old wisdoms" of DW.  Even if they are found to be true it’s good to see some real scientific "assumption breaking" to prove/disprove reality.

  • Use as many small disks as possible – a 1TB disk would be a bad idea for a system that inherently reads large volumes of data, everything would go through a single IO point.

    • Keep all the OLTP tables separate from the DW systems; OLTP has lots of small, fast transactions, DW has slower, big reads. DW loves bitmap indexes, OLTP hates them.

    • Use high degrees of parallel processing

But are these truths still valid?

DI Wisdom (2) – more of the physical

Comparison of OLTP vs DW(OLAP-esque):  Great reference table:

DI Wisdom (3) – departmental business

When people found that their transactional systems were unsuited for BI reporting (perhaps because of the performance impact of running BI on a transactional system, or the transactional system did not hold all of the data required for reporting) they started to look towards dedicated data warehouses.

DI Wisdom (4)

Enterprise DW moves away from the tactical departmental “point solutions” and into something that fits with strategic aspirations of the enterprise. On the face of it having a single solution across the enterprise as distinct advantages:

  • there is a single, consistent model of enterprise data

    • there is less duplication of data across the enterprise
    • it is possible to construct a security model such that the right people see all of the data that allows them to their jobs but not the information that is too sensitive for their job role
    • the origins of all of the business data can be traced back to source

In fact these aims are so laudable that they have been hijacked by other IT disciplines such as master data management, risk and compliance management, and business process reengineering.

DW Design (part 1)

I can’t agree with Pete more: A staging, 3NF warehouse, and then presentation layer (marts) I think is a very practical way to seperate concerns, and avoid tight coupling between source and reports.  A la Corporate Information Factory.

For a long while I have favoured a three section data warehouse design: a staging area where raw fact and reference data is validated for referential integrity, a third-normal form layer to hold the reference data and historical fact, and finally a presentation layer to hold denormalised reference data and aggregated fact. The staging layer is ‘private’ to the data warehouse but user query access (subject to business security rules) to other layers is permitted. In some cases it will not be possible to use a denormalised layer; but if you can use one, you should.

DW Design (2) – staging data

As mentioned yesterday, the staging area of the data warehouse has three functional uses:

  • It is the initial target for data loads from source systems

    • It validates the incoming data for integrity

    • It is the data source for information to be published the ‘user visible” layers of the data warehouse

Optionally, it may also be where the logic to transform incoming data is applied.

Great series Pete!  Now if only this were in a book that I could tell all my blog readers and colleagues to purchase!  :)

Use Open Source BI: Win a Mac Mini

Tomorrow MySQL and Pentaho are presenting on how MySQL and Pentaho can work together to deliver exceptional value when used in combination to solve Business Intelligence and Reporting business challenges.

I’ve been working more and more with MySQL over the past couple of months since joining Pentaho and I’m pleasantly surprised.  On the good side of the 80/20 rule (ie, 80% of users want 20% of the features) it’s exceptionally "good enough" for things that I want to do.

Back to the tagline.

Tomorrow, Pentaho is highlighting our desire to be as easy to MySQL users as MySQL is.  We want to understand how to make it increasingly easy to use Pentaho with MySQL.  In return for providing Pentaho with much needed feedback on ease of use and the user experience for installation/configuration Pentaho is giving away a Mac Mini.  It’s no iPod, thank heavens, as everyone is giving those away these days. 

10am PT, 1pm ET in the US.  Register and dial in here.  Read the press release here.

Roland Bouman on Pentaho Data Integration

I just got on to Roland Boumans blog.

He has an excellent write up on how to get started with Pentaho Data Integration (aka Kettle):

Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".


An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all – rather, it’s a purely declarative task which results in a model.

He wonders at the end about how Kettle is deployed inside of Pentaho.  That’s a great question and something Pentaho/Matt have been working at over the past few months.  Perhaps I can shed a bit of light on this.

In the latest build of Pentaho (I used 1.1.6 for the below screenshots)  we ship an example of a Kettle mapping returning a result set which demonstrates the great architectures of both Kettle and Pentaho.  Kettle provides an easy way for creating plugins and interfaces for steps allowing Pentaho to access data at the "end" of a transformation.  Pentaho has a multipurpose result set object which allows for reports to key off of "data" instead of a SQL Result or an MDX Result, or in this case, a real time data slurp (EII result?!?). 

The transformation in spoon looks like this:

It reads data from the Pentaho sample database (QUADRANT_ACTUALS) filters, and does some calculcations and places then in an operator XML Output.  This output operator is superflous, the KettleComponent (from Pentaho) sources the data DIRECTLY from the in memory XML Output object in Kettle.

The ETL example is basic; it doesn’t even place the data slurped from Kettle into a nicely formatted report.

Just to be clear on what you’re seeing…

Pentaho users gain the full power of the Pentaho Data Ingegration tool (proper ETL sorts, groups, XML, csv, xml, filters, calculations, database lookups, aggregations, etc) when used as a SOURCE for their reports.  A full ETL tool where you don’t HAVE to persist the data to tables or files or anything. 

If it suits your needs, you can simply run your ETL everytime your report is run and you never have to build a staging area, data warehouse, or OLAP cubes.  Your mileage may vary and I don’t think this is wholesale replacement for a data warehouse at all!  Just a great way to provide additional options for BI developers.

Going beyond a single transformation in Pentaho we are currently augmenting this component to allow you to execute Chef Jobs from inside Pentaho.  I have no idea when that code is included in a release but I’ll be sure and blog about it here.  That means that you can schedule your "DW load process" or "Data Integration Routines" to run from inside Pentaho.  I’ll post an example of that shortly…

Again, check out Rolands article!  It’s a great addition to the pentaho blogosphere.

Mondrian: OLAP power at your USERS fingertips

As promised, todays followup on how to limit the complexity of calculated MDX measures from your users using Mondrian.

If you haven’t already, start with yesterdays blog, entitled Mondrian: OLAP power at your fingertips. We leveraged the power of MDX to quickly build a calculated MDX member. While in and of itself not all that powerful (% of total is not that tough) it provides with a quick and easy way for users to start exploring the power of MDX without having to do any XML editing, etc. You can just type it into the web browser and “try before you buy.”

Ok, so let’s say you’re sold. You’ve got your calculated measure exactly how you want; great. There’s no WAY that you’d want your users to have to see that and copy and paste that in themselves. It also leaves room for people changing formulas, which in the day and age of SOX is just a bad idea. What you want to do is make this MDX part of your Cube so that a user can just reference it, and disregard the complexity of the expression.

  1. I’m assuming that we’ve already downloaded and started the Pentaho Demo. Refer to yesterdays blogs for instructions.
  2. Edit the file $PROOT/pentaho-demo/pentaho-solutions/samples/analysis/SampleData.mondrian.xml.
  3. Add the following XML fragment right before the ending Cube tag.
    <CalculatedMember name=”Position Percent of Total” dimension=”Measures” formula=”([Measures].[Actual] / ([Measures].[Actual], [Positions].CurrentMember.Parent))” />

    What we’re doing here is tell Mondrian that we want to add a new member named Position Percent of Total to the Measures dimension (a special dimension) using the forumla we worked out yesterday. In theory, this can be any calculated member you’ve sorted out in MDX which means the full power and expressiveness of MDX can be included here.
  4. You should stop and start the server so that mondrian can pick up the schema changes.
  5. Return to the analysis samples (refer to former blog to find it) and launch the slicer Slice and Dice example. Click on the Cube Navigator and then Measures to see if your new “CalculatedMember” is present:

    It should be there so that you can just uncheck the rest of the measures and only have Actual and Position Percent of Total. This is now the experience that your users would have, if they use JPivot here in the web or say the Pentaho Spreadsheet Services (a blog on that next week).
  6. If you expand the All Positions you should see your calculated measure displayed alongside the Actual measure

That’s it! This is really useful for providing your users a bunch of reusable measures (YTD, Versus Period Prior, etc) without having them have to hack it all day in Excel. Enjoy!

Mondrian: OLAP power at your fingertips

Pentaho Analysis (aka Mondrian) is a powerful OLAP tool that can wrap just about any JDBC datasource and present it multidimensionally.  There are many resources that cover how to get Mondrian connected to a database, some of the ins and outs of the the Mondrian.xml schema definition, etc.  Comment below if you’re also interested in seeing a tutorial here in that regard.

Connecting from MDX to usefullness isn’t always the easiest… How powerful is MDX?  Why would I do something in MDX versus SQL?  Why not build another measure in my fact table to support that query instead of MDX?  Well, I won’t belabor the benefits of MDX and the expressiveness of the language – there are plenty of great books and resources in that regard.

Suffice to say that it makes certain analytic queries (Year to Date, % contribution to total, this versus period prior) easy… sometimes embarrassingly easy compared to what we used to do in pure "star schema relational and SQL group by" land.

Let’s build a custom MDX measure and then in a followup I’ll show how provide this to your end users in Mondrian without this complexity.

  1. Download the latest pre configured installation (aka DEMO) of Pentaho here:
  2. Unzip it into a directory on your local machine (we’ll refer to this as $PROOT from here on out).
  3. Double click on "$PROOT/pentaho-demo/start-pentaho.bat."  When you see the message "Pentaho BI Platform Ready" the platform is ready, about 2 minutes.
  4. Open up firefox (or IE) and go to the following URL
    http://localhost:8080, and then click on Samples and Examples
  5. Find Analysis examples, and click on that

    and then click on
  6. You are now viewing the JPivot application that allows you to navigate and build OLAP reports in a web browser.
  7. (Optional Step) Feel free to explore the JPivot interface at your own leisure, a great place to start is the CUBE NAVIGATOR() which gives a pretty good graphical way of navigating OLAP cubes:  Make changes, and then click on OK to make the crosstab report below change.
  8. Click on MDX to open up the MDX Editor Window
  9. Paste the following MDX into the text area and then hit APPLY:

    select NON EMPTY {[Measures].[Actual]} ON COLUMNS,
    NON EMPTY Hierarchize(Union({[Positions].[All Positions]}, [Positions].[All Positions].Children)) ON ROWS
    from [Quadrant Analysis]

    You should see the report look like this:

  10. What we’re seeing here is the rollup of all of our head count (ie, salary expenditures) by position.  The aggregate figure is nice, for the bean counters, but of interest to the analyst and executives is the "that is that in proportion to my entire enterprise?"  In other words, 1,211,073.00 on HR Training personnel is not as interesting as what % I spend on HR Training personnel as part of the total.  Let’s open paste the following calculated member into the MDX window ABOVE the current MDX statement:
    with member [Measures].[Position Percent of Total] as ‘([Measures].[Actual] / ([Measures].[Actual], [Positions].CurrentMember.Parent))’, format_string = "|#.00%|"

    Click APPLY and then Close the MDX editor (red X or the MDX button again).

    What we’ve done here is built a calculated measure that using MDX.  MDX is very powerful so I suggest you check out some MDX books or resources on the net to explore its capabilities.

  11. Now we need to add our brand new calculated measure to our report.  Click on the cube navigator, then "Measures" then highlight the new Measure we just created:

    You have to click OK twice here to get back to the report.
  12. If all has gone well, you should see the following report with your custom MDX measure:

Well, that’s great for the techies and hard core business analysts, but we’d never want to have to have an end user to have to figure all that out.  How do we buffer the user from that complexity?  That will be tomorrows blog entry on how to make a CalculatedMember in Mondrian. 

Bonus Screenshot of a pie chart in two clicks now based on our percentage:

Google Spreadsheets, pretty cool

There’s been some buzz about googles launch of their first "office-esque" product.  So, I signed up to get a preview to see what all the fuss is about.  To be honest, I was pleasantly surprised with the quality of the application.

Google Spreadsheets is a browser based spreadsheet and collaboration tool.  You make, share, save, edit your spreadsheets in a web browser. 

I was pleasantly surprised to see that it wasn’t a super light grid based notepad; it’s a real spreadsheet with formulas to do real work:

You can highlight ranges of cells, just like good ole Excel:

Overall, I was quite happy with the experience and think it will definitely be useful for individuals and SMBs.  I ponder, like others, how useful it will be in big corporate environments BUT don’t really think of it as a competitor to star office/open office. 

So when are Pivot Tables/Pivot Charts arriving… now THAT would be awesome! :)

Off Topic: "OK to discriminate" referendum defeated

The great state of Washington passed a law adding  "sexual orientation" to the list of groups provided anti-discrimination protection  It’s a sad state of affairs when these measures, of any form, are needed to ensure that people are civil to other people; however there are clear needs for such measures.

There’s this polictician who thought it would be a grand idea to sponsor a referendum to put to ballot a measure that specifically excludes these protections for gay and lesbians citizens.  Sad to say more than 100,000 of my fellow Washingtonians signed the measure, but calm rationale heads prevailed: 

Referendum 65 will not appear on the ballot.

JBoss and Redhat officially wed

In a JBoss community email today:

I am writing to you today to announce that the Red Hat acquisition of JBoss has closed and we now are officially a part of the largest independent open source company. I am excited about this news and the great opportunity that it represents. We are entering a new era in the technology industry that puts customers back in charge of their destiny, where innovation and value replace lock-in and costly proprietary-vendor software licenses. Together, we believe we can change the economics of the industry, delivering unmatched value to our customers and partners by creating better software faster, systematically driving down costs and simplifying IT.

Some news on it here here here as well.

That makes Redhat the largest independent Open Source company in the world. Cool.