Category Archives: Pentaho

Kettle and Pentaho: 1+1=3

Like all great open source products, Pentaho Data Integration (Kettle) is a functional product in and of itself.  It has a very productive UI and delivers exceptional value as a tool in and of itself.  Most pieces of the Pentaho platform reflect a desire to keep the large communities around the original projects (Mondrian, JFree, etc) engaged; they are complete components in and of themselves.

When used together their value, as it relates to building solutions increases and exceeds their use independently.  I’ll be the first to admit that Pentaho is still fairly technical, but we’re rapidly building more and more graphical interfaces and usability features on top of the platform (many in the open source edition, but much is in the professional edition).  Much of this work involves making the "whole" (Pentaho)  work together to exceed the value of the pieces (Mondrian, Kettle, JFree, …).

A few things immediately come to mind of why Pentaho and Kettle together provide exceptional value as compared to used individually or with another open source reporting library:

  1. Pentaho abstracts data access (optionally) from report generation which gives report developers the full POWER of Kettle for building reports.

    There are some things that are tough, if not downright impossible to do in SQL.  Ever do an HTTP retrieval of an XML doc, slurp in a custom lookup from Excel, do a few database joins and analytical calculations in a SQL statement?  I bet not.  Report developers are smart data dudes; having access to a tool that allows them to sort/pivot/group/aggregate/lookup/iterate/list goes on and on/etc empowers report developers in a way that a simple "JDBC" or "CSV" or "XQuery" alone can accomplish. 
    How is this made possible?
    Pentaho abstracts (optionally, it isn’t forced on customers) the data retrievals to lookup components.  This allows BI developers to use either a SQL lookup (DB), XQuery lookup(XML), MDXLookup (OLAP), or Kettle lookup (EII) to populate a "ResultSet."  Here’s the beauty; reports are generated off a result set instead of directly accessing the sources.  This means that a user can use the same reporting templates, framework, designer, etc and feed/calculate data from wherever they desire.  Truly opens a world of possibiliy where before there was "just SQL" or "ETL into DB tables."

  2. Ability to manage the entire solution in one place

    Pentaho has invested greatly in the idea of the solution being a set of "things" that make up your BI, reporting, DW solution.  This means you don’t have ETL in one repository, reports managed somewhere else, scheduling managed by a third party, etc.  It’s open source so that’s obviously a choice, but we can add much value by ensuring that someone who has to transform data, schedule that, email and monitor, secure, build reports, administer email bursting, etc can do some from one "solution repository." Managing an entire BI solution from one CVS repository?  Now that’s COOL (merge diff/patch anyone?).

  3. Configuration Management

    Kettle is quite flexible; the 2.3.0 release extends the scope and locations where you can use variable substitution.  From a practical standpoint this means that an entire Chef job can be parameterized and called from a Pentaho action sequence.  For instance, because you can do your DW load from inside Pentaho action sequences that means you can secure it, schedule it, monitor it, initiate it from an outside workflow via web service, etc.  In one of my recent Kettle solutions ALL OF THE PHYSICAL database, file, and security information was managed by Pentaho so the Kettle mappings can literally be moved from place to place and work inside of Pentaho. 

  4. Metadata and Additional Integration

    Pentaho is investing in making the tools more seamless.  In practice (this is not a roadmap or product direction statement) this means being able to interact with tables, connections, business views inside of Kettle in an identical (at least similar way) in the report designer.  For example, if you’ve defined the business name for a column to be "Actual Sales" Kettle and the Report Designer can now key off that same metadata and present a "consistent" view to the report/ETL developer instead of knowing that "ACT_SL_STD_CURR" is actual sales. 
    Another example is the plans to do some additional Mondrian/Kettle integration to make the building of Dimensions, Cubes, and Aggregates easier.

Use Open Source BI: Win a Mac Mini

Tomorrow MySQL and Pentaho are presenting on how MySQL and Pentaho can work together to deliver exceptional value when used in combination to solve Business Intelligence and Reporting business challenges.

I’ve been working more and more with MySQL over the past couple of months since joining Pentaho and I’m pleasantly surprised.  On the good side of the 80/20 rule (ie, 80% of users want 20% of the features) it’s exceptionally "good enough" for things that I want to do.

Back to the tagline.

Tomorrow, Pentaho is highlighting our desire to be as easy to MySQL users as MySQL is.  We want to understand how to make it increasingly easy to use Pentaho with MySQL.  In return for providing Pentaho with much needed feedback on ease of use and the user experience for installation/configuration Pentaho is giving away a Mac Mini.  It’s no iPod, thank heavens, as everyone is giving those away these days. 

10am PT, 1pm ET in the US.  Register and dial in here.  Read the press release here.

Roland Bouman on Pentaho Data Integration

I just got on to Roland Boumans blog.

He has an excellent write up on how to get started with Pentaho Data Integration (aka Kettle):

Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".

….

An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all – rather, it’s a purely declarative task which results in a model.

He wonders at the end about how Kettle is deployed inside of Pentaho.  That’s a great question and something Pentaho/Matt have been working at over the past few months.  Perhaps I can shed a bit of light on this.

In the latest build of Pentaho (I used 1.1.6 for the below screenshots)  we ship an example of a Kettle mapping returning a result set which demonstrates the great architectures of both Kettle and Pentaho.  Kettle provides an easy way for creating plugins and interfaces for steps allowing Pentaho to access data at the "end" of a transformation.  Pentaho has a multipurpose result set object which allows for reports to key off of "data" instead of a SQL Result or an MDX Result, or in this case, a real time data slurp (EII result?!?). 

The transformation in spoon looks like this:

It reads data from the Pentaho sample database (QUADRANT_ACTUALS) filters, and does some calculcations and places then in an operator XML Output.  This output operator is superflous, the KettleComponent (from Pentaho) sources the data DIRECTLY from the in memory XML Output object in Kettle.

The ETL example is basic; it doesn’t even place the data slurped from Kettle into a nicely formatted report.

Just to be clear on what you’re seeing…

Pentaho users gain the full power of the Pentaho Data Ingegration tool (proper ETL sorts, groups, XML, csv, xml, filters, calculations, database lookups, aggregations, etc) when used as a SOURCE for their reports.  A full ETL tool where you don’t HAVE to persist the data to tables or files or anything. 

If it suits your needs, you can simply run your ETL everytime your report is run and you never have to build a staging area, data warehouse, or OLAP cubes.  Your mileage may vary and I don’t think this is wholesale replacement for a data warehouse at all!  Just a great way to provide additional options for BI developers.

Going beyond a single transformation in Pentaho we are currently augmenting this component to allow you to execute Chef Jobs from inside Pentaho.  I have no idea when that code is included in a release but I’ll be sure and blog about it here.  That means that you can schedule your "DW load process" or "Data Integration Routines" to run from inside Pentaho.  I’ll post an example of that shortly…

Again, check out Rolands article!  It’s a great addition to the pentaho blogosphere.

Mondrian: OLAP power at your USERS fingertips

As promised, todays followup on how to limit the complexity of calculated MDX measures from your users using Mondrian.

If you haven’t already, start with yesterdays blog, entitled Mondrian: OLAP power at your fingertips. We leveraged the power of MDX to quickly build a calculated MDX member. While in and of itself not all that powerful (% of total is not that tough) it provides with a quick and easy way for users to start exploring the power of MDX without having to do any XML editing, etc. You can just type it into the web browser and “try before you buy.”

Ok, so let’s say you’re sold. You’ve got your calculated measure exactly how you want; great. There’s no WAY that you’d want your users to have to see that and copy and paste that in themselves. It also leaves room for people changing formulas, which in the day and age of SOX is just a bad idea. What you want to do is make this MDX part of your Cube so that a user can just reference it, and disregard the complexity of the expression.

  1. I’m assuming that we’ve already downloaded and started the Pentaho Demo. Refer to yesterdays blogs for instructions.
  2. Edit the file $PROOT/pentaho-demo/pentaho-solutions/samples/analysis/SampleData.mondrian.xml.
  3. Add the following XML fragment right before the ending Cube tag.
    <CalculatedMember name=”Position Percent of Total” dimension=”Measures” formula=”([Measures].[Actual] / ([Measures].[Actual], [Positions].CurrentMember.Parent))” />

    What we’re doing here is tell Mondrian that we want to add a new member named Position Percent of Total to the Measures dimension (a special dimension) using the forumla we worked out yesterday. In theory, this can be any calculated member you’ve sorted out in MDX which means the full power and expressiveness of MDX can be included here.
  4. You should stop and start the server so that mondrian can pick up the schema changes.
    $PROOT/pentaho-demo/stop-pentaho.bat
    $PROOT/pentaho-demo/start-pentaho.bat
  5. Return to the analysis samples (refer to former blog to find it) and launch the slicer Slice and Dice example. Click on the Cube Navigator and then Measures to see if your new “CalculatedMember” is present:

    It should be there so that you can just uncheck the rest of the measures and only have Actual and Position Percent of Total. This is now the experience that your users would have, if they use JPivot here in the web or say the Pentaho Spreadsheet Services (a blog on that next week).
  6. If you expand the All Positions you should see your calculated measure displayed alongside the Actual measure

That’s it! This is really useful for providing your users a bunch of reusable measures (YTD, Versus Period Prior, etc) without having them have to hack it all day in Excel. Enjoy!

Mondrian: OLAP power at your fingertips

Pentaho Analysis (aka Mondrian) is a powerful OLAP tool that can wrap just about any JDBC datasource and present it multidimensionally.  There are many resources that cover how to get Mondrian connected to a database, some of the ins and outs of the the Mondrian.xml schema definition, etc.  Comment below if you’re also interested in seeing a tutorial here in that regard.

Connecting from MDX to usefullness isn’t always the easiest… How powerful is MDX?  Why would I do something in MDX versus SQL?  Why not build another measure in my fact table to support that query instead of MDX?  Well, I won’t belabor the benefits of MDX and the expressiveness of the language – there are plenty of great books and resources in that regard.

Suffice to say that it makes certain analytic queries (Year to Date, % contribution to total, this versus period prior) easy… sometimes embarrassingly easy compared to what we used to do in pure "star schema relational and SQL group by" land.

Let’s build a custom MDX measure and then in a followup I’ll show how provide this to your end users in Mondrian without this complexity.

  1. Download the latest pre configured installation (aka DEMO) of Pentaho here:
    http://www.pentaho.org/download/latest.html
  2. Unzip it into a directory on your local machine (we’ll refer to this as $PROOT from here on out).
  3. Double click on "$PROOT/pentaho-demo/start-pentaho.bat."  When you see the message "Pentaho BI Platform Ready" the platform is ready, about 2 minutes.
  4. Open up firefox (or IE) and go to the following URL
    http://localhost:8080, and then click on Samples and Examples
  5. Find Analysis examples, and click on that

    and then click on
  6. You are now viewing the JPivot application that allows you to navigate and build OLAP reports in a web browser.
  7. (Optional Step) Feel free to explore the JPivot interface at your own leisure, a great place to start is the CUBE NAVIGATOR() which gives a pretty good graphical way of navigating OLAP cubes:  Make changes, and then click on OK to make the crosstab report below change.
  8. Click on MDX to open up the MDX Editor Window
  9. Paste the following MDX into the text area and then hit APPLY:

    select NON EMPTY {[Measures].[Actual]} ON COLUMNS,
    NON EMPTY Hierarchize(Union({[Positions].[All Positions]}, [Positions].[All Positions].Children)) ON ROWS
    from [Quadrant Analysis]

    You should see the report look like this:

  10. What we’re seeing here is the rollup of all of our head count (ie, salary expenditures) by position.  The aggregate figure is nice, for the bean counters, but of interest to the analyst and executives is the "that is that in proportion to my entire enterprise?"  In other words, 1,211,073.00 on HR Training personnel is not as interesting as what % I spend on HR Training personnel as part of the total.  Let’s open paste the following calculated member into the MDX window ABOVE the current MDX statement:
    with member [Measures].[Position Percent of Total] as ‘([Measures].[Actual] / ([Measures].[Actual], [Positions].CurrentMember.Parent))’, format_string = "|#.00%|"

    Click APPLY and then Close the MDX editor (red X or the MDX button again).

    What we’ve done here is built a calculated measure that using MDX.  MDX is very powerful so I suggest you check out some MDX books or resources on the net to explore its capabilities.

  11. Now we need to add our brand new calculated measure to our report.  Click on the cube navigator, then "Measures" then highlight the new Measure we just created:

    You have to click OK twice here to get back to the report.
  12. If all has gone well, you should see the following report with your custom MDX measure:

Well, that’s great for the techies and hard core business analysts, but we’d never want to have to have an end user to have to figure all that out.  How do we buffer the user from that complexity?  That will be tomorrows blog entry on how to make a CalculatedMember in Mondrian. 

Bonus Screenshot of a pie chart in two clicks now based on our percentage:

Reporting in 60 seconds with Oracle XE

Ok, well maybe not 60 seconds but more like 5 minutes.

Some great news today from Pentaho.  Pentaho has purchased proprietary software for visually building reports and donated it to the open source community.  The full text of the announcement is here:

 More on this in a later post… Let’s get to the bits.

Much of what I’ve been hearing in the marketplace is the perception that Open Source has hidden costs in terms of usability and the technical savvy needed of users. 

Let us try and challenge this assumption and see how "productive" we can be building a simple database driven report.

Download the Report Designer here and unzip it to a directory of your choosing.  There’s zero installation required (you’ll need Java on your computer, but that’s pretty prevalent these days).  Double click on the bat file (or .sh if you’re using linux).

When the product launches you’ll notice a panel that allows you drag and drop elements onto the canvas in an easy to use fashion.  I’m sure additional tutorials will come on this blog and Pentaho.org but suffice to say, this is an interface which is easy enough to just "figure out" by playing with it.

The "RED" circle is our palette and the "BLUE" area is our canvas.

Let’s get going and start our report!

  1. Launch the Report Wizard
  2. Select the option that indicates that you will use your own dataset, a JDBC dataset
  3. Enter in the connection information and driver location for your database.  I keep the latest Oracle Express edition (a Free version of the Oracle database) around.
    Note:  You don’t have to do anything special with the JDBC driver (put it in a lib directory or anything).  All you have to do is show the wizard where the .jar file is.

    I’m using the HR schema which comes with the Oracle database for testing/sample purposes.
  4. Navigate to find the "EMPLOYEES" and "DEPARTMENT" tables.  Enter in a simple SQL query that contains the data of interest and use the very handy Preview Option to make sure you’re getting the results desired.

    Feel free to just copy and paste if you are using XE as well:
    select
    d.department_name "Department Name",
    e.first_name "First Name",
    e.last_name "Last Name",
    e.phone_number "Phone Number"
    from
    employees e, departments d
    order by d.department_name
  5. You can just click Next on the visible feilds screen:
  6. Highlight "Department Name" and add it to the grouping and hit "Finish"
  7. It should return you to the original canvas where you can see the design of your report.  From here you can start to add images, lines, change fonts, etc.  Have fun exploring the richness of the options available here.
  8. Let’s see what it looks like!  Click Preview and page through your report!

That’s it!  You’re done!  You’ve created a report that can be parameterized, hosted, bursted, distributed in just a few minutes.  In a follow on blog I’ll show you how to publish this into the Pentaho server (it’s only about 2 more steps).

Happy Trails!

OLAP Survey

I noticed some other bloggers posted but I’ll encourage people as well. The more people that respond the more accurate the survey represents reality!
“We would very much welcome your participation in The OLAP Survey 6. This is the largest independent survey of business intelligence users worldwide. The Survey will obtain input from a large number of users to better understand their buying decisions, the implementation cycle and the business success achieved. Both business and technical respondents are welcome.

The OLAP Survey is strictly independent. While Oracle, Microsoft, Cognos and other vendors assist by inviting users to participate in the Survey, the vendors do not sponsor the survey, nor influence the questionnaire design or survey results. As a participant, you will not only have the opportunity to ensure your experiences are included in the analyses, but you will also receive a summary of the results from the full survey. You will also have a chance of winning one of ten $50 Amazon vouchers. Click here to complete the survey on-line.”

UPSERTS using Kettle/MySQL

Oracle Readers: Think MERGE Statement. 🙂

It’s common to need to generate surrogate keys as part of a Data Warehouse / Business Intelligence solution. Protecting your users from souce system changes, or using time variant data are just a couple of the reasons why one needs to generate a surrogate key when loading data into a DW.

Here is a simple example. I’m loading Product data from an XML file into a MySQL table. The transformation in Kettle looks like this:

upsert_kettle_transform.png

The output coming out of the XML file has been flattened using XQuery so that it looks like a flat table struture. Here’s a screen capture of a nifty feature of Kettle where you can preview the output of a step:

upsert_xml_data.png

Ok, this is our source data. The ID you’re seeing is the OLTP key. We’ll refer to it as the Natural Key. This is the ORDER_ID, or CUSTOMER_ID, or well, you get the idea. This is the key that your source application uses in it’s relational structures.

What we want is to INSERT or UPDATE (aka UPSERT) this record into our DW. We’ll INSERT if we don’t already have this NATURAL key in the DW and generate a surrogate primary key. We’ll UPDATE if we already have this NATURAL key in the warehouse.
Our table looks like:
upsert_product_erd.png

The DDL to create the TABLE is important:

CREATE TABLE INT_PRODUCT (
PRODUCT_ID INTEGER NOT NULL AUTO_INCREMENT,
PRODUCT_NAT_ID INTEGER NOT NULL,
TITLE TINYTEXT,
CONSTRAINT INT_PRODUCT_PK PRIMARY KEY (PRODUCT_ID),
UNIQUE INT_PRODUCT_NK (PRODUCT_NAT_ID)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Notice that our PRIMARY KEY is the Surrogate we’ve set up to be an AUTO_INCREMENTING column. This means that MySQL will generate a surrogate when we INSERT a record. All we have to do is get Kettle to match our unique contrainst (INT_PRODUCT_NK) and either INSERT/UPDATE.

I’ve used the INSERT / UPDATE operator in Kettle. The configuration looks like this:
upsert_upsert_config.png
What we’re doing is configuring Kettle to do a lookup on the table to see if the natural key already exists. If it does, we’ll perform DML (insert or update) on the fields in the lower window.

After running the transformation my INT_PRODUCT table looks like this.
upsert_surrogate.png
Now, I realized I didn’t choose very illustrative natural keys (all are 1,2,3) but the PRODUCT_ID comes from MySQL as a surrogate.

Subsequent executions of this Kettle transformation will INSERT records as they arrive or UPDATE. MySQL and Kettle in combination are making quick work out of generating these surrogates!

I invite Kettle experts to chime in on any optimizations/quirks.

Connect to your own Database

I get asked frequently by people who evaluate Pentaho: How do I get started? Many people (tens of thousands on a monthly basis) download the Pre Configured application and immediately get a sense for the benefits of the Pentaho open source project.

What people almost always want to do next, is get started with their OWN data. You the reader of this blog might find a pre canned demo interesting, but you’ll want to reserve judgement on the leading open source BI Project until you’ve seen your own data flowing into beautiful PDFs/Excel/OLAP Pivot Tables. I totally understand…

This is a very SIMPLE, BASIC, LIGHTWEIGHT guide on how to get the downloadable DEMO connected to YOUR DATABASE and executing YOUR SQL Query using the lynchpin of the Pentaho platform, an Action Sequence (refer to this blog for more on what Action Sequences are).

Disclaimer: Pentaho has development tools that once installed makes the development of action sequences more productive then editing xml as below. In other words, this is just rough and dirty to get you going and connect… I highly suggest downloading the Development Workbench to actually build the Action Sequences.

Step 1:
Collect the following information for your database.

  • Vendor (Oracle/MySQL) provided JDBC Jar file (classes12.zip, mysql-connector-java-3.1.12-bin.jar)
  • JDBC Connection URL (jdbc:mysql://localhost/database)
  • JDBC User Name (scott)
  • JDBC Password (tiger)
  • A SQL query that you are certain works in your database (select id, title from status)

This is the information needed for ANY Java Application to connect to a database, and is not really Pentaho specific.

Step 2:
Create (using notepad or vi) a file named myfirst-ds.xml in $DEMO_BASE/jboss/server/default/deploy/

Put the following into that file to create a JNDI datasource with the JNDI-NAME “myfirst”

myfirst_001.png
Note: You need to replace the above items with the corresponding JDBC information you collected in the first step.

Step 3:
Edit (using notepad or vi) the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/web.xml.
Add the following right below the Shark Connection entry

myfirst_002.png

Step 4:
Edit (using notepad or vi) the file $DEMO_BASE/jboss/server/default/deploy/pentaho.war/WEB-INF/jboss-web.xml.
Add the following right below the Shark entry
myfirst_003.png

Step 5:
Copy your Vendor provided jdbc driver library (classes12.zip) in the following directory:
$DEMO_BASE/jboss/server/default/lib

Congratulations, you’ve setup your JNDI datasource! Whew…. Almost there.

Step 6:
Let’s create your first action sequence!
Create (using notepad or vi) a file named MyFirst.xaction in $DEMO_BASE/pentaho-solutions/samples/datasources
myfirst_004.png
Make sure and replace the select I have above with a select statement that you KNOW works in your database.

Step 7:
Restart your server using the stop-pentaho/start-pentaho commands in $DEMO_BASE.

Step 8:
Open up Firefox (cough IE) and hit the following URL:
http://localhost:8080/pentaho/ViewAction?&solution=samples&path=datasources&action=MyFirst.xaction

You should see the basic output of the Action Sequence which, in this case, is the results of the SQL you provided.

Action Successful


id title
1 Open

That’s it! You’re connected to your own datasource and now you can start to explore the immense power of Action Sequences with your data… There are approximately 50 example .xactions in the demo so use them as a reference when trying to exploring the features of the platform. I HIGHLY SUGGEST downloading the workbench as well. It’s a visual environment for authoring .xaction files. Trust me, you’ll appreciate the GUI!