Archive

Archive for the ‘Data Integration (Kettle)’ Category

Command line ETL Job Execution

November 20th, 2006

I know this might seem pretty obivous to those that use Kettle frequently, but there’s a VERY easy way to execute Kettle jobs at the command line. Kitchen is the command line interface and is quite convenient for executing that ETL job you’ve built. Crontab anyone?

kitchen.sh -file=/mnt/pentaho-professional/pentaho-solutions/software-quality/data/etl/jira_do_everything.kjb
-OR-
kitchen.bat -file=c:\dir\jira_do_everything.kjb
kitchen.bat /file:”c:\dir\jira_do_everything.kjb” (from comments below, thanks!!!)

Does anyone use kitchen or pan and have any best practices or suggestions to offer?

Data Integration (Kettle), Pentaho

Kettle and Pentaho: 1+1=3

June 29th, 2006

Like all great open source products, Pentaho Data Integration (Kettle) is a functional product in and of itself.  It has a very productive UI and delivers exceptional value as a tool in and of itself.  Most pieces of the Pentaho platform reflect a desire to keep the large communities around the original projects (Mondrian, JFree, etc) engaged; they are complete components in and of themselves.

When used together their value, as it relates to building solutions increases and exceeds their use independently.  I’ll be the first to admit that Pentaho is still fairly technical, but we’re rapidly building more and more graphical interfaces and usability features on top of the platform (many in the open source edition, but much is in the professional edition).  Much of this work involves making the "whole" (Pentaho)  work together to exceed the value of the pieces (Mondrian, Kettle, JFree, …).

A few things immediately come to mind of why Pentaho and Kettle together provide exceptional value as compared to used individually or with another open source reporting library:

  1. Pentaho abstracts data access (optionally) from report generation which gives report developers the full POWER of Kettle for building reports.

    There are some things that are tough, if not downright impossible to do in SQL.  Ever do an HTTP retrieval of an XML doc, slurp in a custom lookup from Excel, do a few database joins and analytical calculations in a SQL statement?  I bet not.  Report developers are smart data dudes; having access to a tool that allows them to sort/pivot/group/aggregate/lookup/iterate/list goes on and on/etc empowers report developers in a way that a simple "JDBC" or "CSV" or "XQuery" alone can accomplish. 
    How is this made possible?
    Pentaho abstracts (optionally, it isn’t forced on customers) the data retrievals to lookup components.  This allows BI developers to use either a SQL lookup (DB), XQuery lookup(XML), MDXLookup (OLAP), or Kettle lookup (EII) to populate a "ResultSet."  Here’s the beauty; reports are generated off a result set instead of directly accessing the sources.  This means that a user can use the same reporting templates, framework, designer, etc and feed/calculate data from wherever they desire.  Truly opens a world of possibiliy where before there was "just SQL" or "ETL into DB tables."

  2. Ability to manage the entire solution in one place

    Pentaho has invested greatly in the idea of the solution being a set of "things" that make up your BI, reporting, DW solution.  This means you don’t have ETL in one repository, reports managed somewhere else, scheduling managed by a third party, etc.  It’s open source so that’s obviously a choice, but we can add much value by ensuring that someone who has to transform data, schedule that, email and monitor, secure, build reports, administer email bursting, etc can do some from one "solution repository." Managing an entire BI solution from one CVS repository?  Now that’s COOL (merge diff/patch anyone?).

  3. Configuration Management

    Kettle is quite flexible; the 2.3.0 release extends the scope and locations where you can use variable substitution.  From a practical standpoint this means that an entire Chef job can be parameterized and called from a Pentaho action sequence.  For instance, because you can do your DW load from inside Pentaho action sequences that means you can secure it, schedule it, monitor it, initiate it from an outside workflow via web service, etc.  In one of my recent Kettle solutions ALL OF THE PHYSICAL database, file, and security information was managed by Pentaho so the Kettle mappings can literally be moved from place to place and work inside of Pentaho. 

  4. Metadata and Additional Integration

    Pentaho is investing in making the tools more seamless.  In practice (this is not a roadmap or product direction statement) this means being able to interact with tables, connections, business views inside of Kettle in an identical (at least similar way) in the report designer.  For example, if you’ve defined the business name for a column to be "Actual Sales" Kettle and the Report Designer can now key off that same metadata and present a "consistent" view to the report/ETL developer instead of knowing that "ACT_SL_STD_CURR" is actual sales. 
    Another example is the plans to do some additional Mondrian/Kettle integration to make the building of Dimensions, Cubes, and Aggregates easier.

Data Integration (Kettle), General BI, Open Source, Pentaho

Roland Bouman on Pentaho Data Integration

June 12th, 2006

I just got on to Roland Boumans blog.

He has an excellent write up on how to get started with Pentaho Data Integration (aka Kettle):

Kettle is a free, open source (LGPL) ETL (Extraction, Transformation and Loading) tool. The product name should actually be spelled as K.E.T.T.L.E, which is a recursive acronym for "Kettle Extraction, Transport, Transformation and Loading Environment".

….

An interesting feature of Kettle is that it is model-driven. Both Spoon and Chef offer a graphical user interface to define the ETL processes on a high level. Typically, this involves no actual programming at all - rather, it’s a purely declarative task which results in a model.

He wonders at the end about how Kettle is deployed inside of Pentaho.  That’s a great question and something Pentaho/Matt have been working at over the past few months.  Perhaps I can shed a bit of light on this.

In the latest build of Pentaho (I used 1.1.6 for the below screenshots)  we ship an example of a Kettle mapping returning a result set which demonstrates the great architectures of both Kettle and Pentaho.  Kettle provides an easy way for creating plugins and interfaces for steps allowing Pentaho to access data at the "end" of a transformation.  Pentaho has a multipurpose result set object which allows for reports to key off of "data" instead of a SQL Result or an MDX Result, or in this case, a real time data slurp (EII result?!?). 

The transformation in spoon looks like this:

It reads data from the Pentaho sample database (QUADRANT_ACTUALS) filters, and does some calculcations and places then in an operator XML Output.  This output operator is superflous, the KettleComponent (from Pentaho) sources the data DIRECTLY from the in memory XML Output object in Kettle.

The ETL example is basic; it doesn’t even place the data slurped from Kettle into a nicely formatted report.

Just to be clear on what you’re seeing…

Pentaho users gain the full power of the Pentaho Data Ingegration tool (proper ETL sorts, groups, XML, csv, xml, filters, calculations, database lookups, aggregations, etc) when used as a SOURCE for their reports.  A full ETL tool where you don’t HAVE to persist the data to tables or files or anything. 

If it suits your needs, you can simply run your ETL everytime your report is run and you never have to build a staging area, data warehouse, or OLAP cubes.  Your mileage may vary and I don’t think this is wholesale replacement for a data warehouse at all!  Just a great way to provide additional options for BI developers.

Going beyond a single transformation in Pentaho we are currently augmenting this component to allow you to execute Chef Jobs from inside Pentaho.  I have no idea when that code is included in a release but I’ll be sure and blog about it here.  That means that you can schedule your "DW load process" or "Data Integration Routines" to run from inside Pentaho.  I’ll post an example of that shortly…

Again, check out Rolands article!  It’s a great addition to the pentaho blogosphere.

Data Integration (Kettle), Pentaho

UPSERTS using Kettle/MySQL

May 4th, 2006

Oracle Readers: Think MERGE Statement. :)

It’s common to need to generate surrogate keys as part of a Data Warehouse / Business Intelligence solution. Protecting your users from souce system changes, or using time variant data are just a couple of the reasons why one needs to generate a surrogate key when loading data into a DW.

Here is a simple example. I’m loading Product data from an XML file into a MySQL table. The transformation in Kettle looks like this:

upsert_kettle_transform.png

The output coming out of the XML file has been flattened using XQuery so that it looks like a flat table struture. Here’s a screen capture of a nifty feature of Kettle where you can preview the output of a step:

upsert_xml_data.png

Ok, this is our source data. The ID you’re seeing is the OLTP key. We’ll refer to it as the Natural Key. This is the ORDER_ID, or CUSTOMER_ID, or well, you get the idea. This is the key that your source application uses in it’s relational structures.

What we want is to INSERT or UPDATE (aka UPSERT) this record into our DW. We’ll INSERT if we don’t already have this NATURAL key in the DW and generate a surrogate primary key. We’ll UPDATE if we already have this NATURAL key in the warehouse.
Our table looks like:
upsert_product_erd.png

The DDL to create the TABLE is important:

CREATE TABLE INT_PRODUCT (
PRODUCT_ID INTEGER NOT NULL AUTO_INCREMENT,
PRODUCT_NAT_ID INTEGER NOT NULL,
TITLE TINYTEXT,
CONSTRAINT INT_PRODUCT_PK PRIMARY KEY (PRODUCT_ID),
UNIQUE INT_PRODUCT_NK (PRODUCT_NAT_ID)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Notice that our PRIMARY KEY is the Surrogate we’ve set up to be an AUTO_INCREMENTING column. This means that MySQL will generate a surrogate when we INSERT a record. All we have to do is get Kettle to match our unique contrainst (INT_PRODUCT_NK) and either INSERT/UPDATE.

I’ve used the INSERT / UPDATE operator in Kettle. The configuration looks like this:
upsert_upsert_config.png
What we’re doing is configuring Kettle to do a lookup on the table to see if the natural key already exists. If it does, we’ll perform DML (insert or update) on the fields in the lower window.

After running the transformation my INT_PRODUCT table looks like this.
upsert_surrogate.png
Now, I realized I didn’t choose very illustrative natural keys (all are 1,2,3) but the PRODUCT_ID comes from MySQL as a surrogate.

Subsequent executions of this Kettle transformation will INSERT records as they arrive or UPDATE. MySQL and Kettle in combination are making quick work out of generating these surrogates!

I invite Kettle experts to chime in on any optimizations/quirks.

Data Integration (Kettle), Pentaho