UPSERTS using Kettle/MySQL

Oracle Readers: Think MERGE Statement. :)

It’s common to need to generate surrogate keys as part of a Data Warehouse / Business Intelligence solution. Protecting your users from souce system changes, or using time variant data are just a couple of the reasons why one needs to generate a surrogate key when loading data into a DW.

Here is a simple example. I’m loading Product data from an XML file into a MySQL table. The transformation in Kettle looks like this:

upsert_kettle_transform.png

The output coming out of the XML file has been flattened using XQuery so that it looks like a flat table struture. Here’s a screen capture of a nifty feature of Kettle where you can preview the output of a step:

upsert_xml_data.png

Ok, this is our source data. The ID you’re seeing is the OLTP key. We’ll refer to it as the Natural Key. This is the ORDER_ID, or CUSTOMER_ID, or well, you get the idea. This is the key that your source application uses in it’s relational structures.

What we want is to INSERT or UPDATE (aka UPSERT) this record into our DW. We’ll INSERT if we don’t already have this NATURAL key in the DW and generate a surrogate primary key. We’ll UPDATE if we already have this NATURAL key in the warehouse.
Our table looks like:
upsert_product_erd.png

The DDL to create the TABLE is important:

CREATE TABLE INT_PRODUCT (
PRODUCT_ID INTEGER NOT NULL AUTO_INCREMENT,
PRODUCT_NAT_ID INTEGER NOT NULL,
TITLE TINYTEXT,
CONSTRAINT INT_PRODUCT_PK PRIMARY KEY (PRODUCT_ID),
UNIQUE INT_PRODUCT_NK (PRODUCT_NAT_ID)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Notice that our PRIMARY KEY is the Surrogate we’ve set up to be an AUTO_INCREMENTING column. This means that MySQL will generate a surrogate when we INSERT a record. All we have to do is get Kettle to match our unique contrainst (INT_PRODUCT_NK) and either INSERT/UPDATE.

I’ve used the INSERT / UPDATE operator in Kettle. The configuration looks like this:
upsert_upsert_config.png
What we’re doing is configuring Kettle to do a lookup on the table to see if the natural key already exists. If it does, we’ll perform DML (insert or update) on the fields in the lower window.

After running the transformation my INT_PRODUCT table looks like this.
upsert_surrogate.png
Now, I realized I didn’t choose very illustrative natural keys (all are 1,2,3) but the PRODUCT_ID comes from MySQL as a surrogate.

Subsequent executions of this Kettle transformation will INSERT records as they arrive or UPDATE. MySQL and Kettle in combination are making quick work out of generating these surrogates!

I invite Kettle experts to chime in on any optimizations/quirks.

7 thoughts on “UPSERTS using Kettle/MySQL

  1. Peter Quirk

    Nicholas,

    thanks for this tantalizing insight into Kettle. On the basis of this article I invested some time in installing the Pentaho stack and trying to learn the various parts of the solution. I would welcome a complete worked example of building something like the foodmart DW, a la the MSAS 2000 tutorial shipped with the Microsoft product. I offered this as a suggestion on the Pentaho forum for Kettle and was treated to an amazingly sarcastic response. For those of us familiar with other BI platforms, it would help to see a complete example rather than isolated examples of features, or minimal examples such as the Hello ETL sample.

    — Peter

    Reply
  2. ngoodman Post author

    Peter,

    I’m glad that you are taking the time to use Pentaho. There’s a wealth of functionality in the product and I’m usually amazed at how many features are present.

    I totally understand the desire for a real sample solution. ie, here’s your relational database and all the technologies (Chef and Spoon inside of Kettle) to move it from a relational to dimensional, the mondrian/pentaho configuration to go with it, etc. :) I’m working on these reference application as we speak… I just read the thread @ pentaho as well… I’m working on these “solutions” matt refers to.

    Feel free to use me/forums as a resource if you have specific feature questions. I’ll be sure and post here the samples which won’t be foodmart, but it will be illustrative in the same way. I bet you’ll find it very helpful moving from the other tools you mentioned.

    Kind regards!

    Reply
  3. Elljay

    This upsert step is very slow and inefficient. It does a query for every incoming row from the source, to find out if it exists on the target. Then based on the result it does an insert or update. This does not scale and it does not compare well with the Oracle merge function,

    Reply
  4. Naveen Arumugam

    Hi Nicholas

    My situation is like this. (1) whenever a record (already existing in dim table) with new info arrives, I need to insert it and date-invalidate the previous record and generate a new dim_key. How do I do it using Kettle?

    Reply
  5. vandana

    When we use the Update/Insert option, it successfully updates, but it gives an error of ‘Duplicate entry 2 for key Primary’.

    Please help to remove this error.
    Thanks

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>