KETL != KETTLE

I was having a discussion with the CEO of an Open Source company recently and we started discussing our opinion of the “kett uhl” Open Source ETL projects. It quickly became clear that we had different ideas about the project and its sponsorship. Until there was a clearly identifying difference… An “ah ha” moment.

Kinetic ETL (open source ETL project) is not the same as KETTLE (open source ETL project).

There’s only a handful of Open Source ETL projects currently. It seems silly that there should already be brand confusion for such a small group of players… Anyone from either of these projects care to comment on their choice of such similar names? Pure coincidence?

OWB Beta 4 Crosses 1GB requirement

Not earth shattering news, and who knows if there isn’t some “extra stuff” in it for the Beta releases, but I just realized that OWB is placing more than 1GB of it’s own bits on my virtual workstation. Beta4 would not fit on one CD (800 MB).

Of course the actual release might differ, and your mileage may vary, etc. That, and who really cares about disk space these days? Disk space is cheap! 🙂

Open Source BI getting real

Don’t want to get too far into it, because there has been some significant developments from when I last surveyed the scene about 6 months ago. However, I did want to point out that Pentaho has made significant progress in building their Open Source BI stack and are starting to build 1.x release candidates. They’ve sorted out $5 million in financing to really allow them to take this to the next level.

Congratulations on the progress, and I personally look forward to seeing more great work coming from Orlando, FL.

Paris Virtual User Group Meeting

I had organized an opportunity for Paris users to get together and swap stories, tips, etc. There were many different viewpoints represented and I’m glad that we had a chance to virtually meet up and discuss our experiences with the product. There is not another virtual user group meeting planned, but we are setting up an online forum to continue the peer to peer discussions.

If you are an OWB Beta program member (legal reasons) and would like to get on the forum, please email me.

PS – I’m happy to open it up to the community as soon as the developer preview release hits OTN.

Simple OWB Runtime Report

A slight variation on some SQL I posted a few months back:

For MAPPING executions list the Start TIme, clock time in seconds, number of records selected, and the throughput.
select
ae.created_on start_time,
ae.object_name map_name,
ae.elapse_time time_in_seconds,
mr.number_records_selected total_num_records,
trunc(mr.number_records_selected / ae.elapse_time) records_per_second
from
all_rt_audit_executions ae,
all_rt_audit_map_runs mr
where
1 = 1
and ae.execution_audit_id = mr.execution_audit_id
and ae.elapse_time != 0
order by 1

WARNING: While I think it poses little risk, use at your own risk and ensure it’s right for your environment… Same thing everyone says about code/SQL posted on the internet.

Best ETL for this in Oracle?

I’m usually posting pictures about fancy new features in OWB Paris, and doling out little tidbits about how to this and that… I thought I’d change it up a bit and ask a question! Now that I’ve got comments working I can do fun things like that! I don’t want to influence any ideas so I won’t mention how I would go about this, so, here ’tis.

What is the most EFFECIENT way using Oracle SQL to calculate effective/expiration for denormalized tables? ie, if you have two tables with the SAME natural key with their own set of effective/expiration dates what is the most EFFICIENT SQL to generate a new denormalized table with accurate effective/expiration dates?

Consider the following example ERD:

What’s everyones favorite solution?

OWB 10gR2 : Real Time Data Warehousing

There’s lots of talk about real time, right time, period batch, message based in the Data Warehousing and BI circles these days. I think this is driven by quite a few reasons. Need for fresh data, need for unified reporting interfaces for users, etc. Mostly, I think it comes down to a TCO for IT assets. As the EAI/EII/ETL tools start to converge along with increased SOA-ee-ness of databases and middleware products there becomes quite a bit of overlap between the different product sets. Managing “one product” that does this data integration, calcuation, and movement between systems costs less to maintain than “multiple products.” Truthfully, I see little strategic (ie, warehouse and marts) data that needs to be computed in real time. Those cases do exist, and OWB 10gR2 has some new features for those that do have some Real Time DW/BI needs.

There are two major flavors of mappings in support of Real Time Data Warehousing in OWB:

  • PERIODIC BATCH: This is basically a batch process that runs frequently (say every minute or so) that reads data from a QUEUE or STREAM. While the data is pushed into the DW (real time), the system only processes when run (batch). These are regular mappings that use a Stream or Queue as a source instead traditional Tables/Views/etc.
  • TRICKLE FEED: This is much closer to what most people think of when we refer to real time data warehouse. Trickle feeds involve processing each individual record as it arrives, instead of waiting for them to collect. These are a special kind of OWB mapping called Real Time Mappings that run continuously and process records as they arrive.

Truthfully I’ve only kicked the tires with both of these types of mappings limitedly. I tested some of the features back in OWB Beta2 and built a conceptual mockup of how it would work for a customer of mine. What I’m presenting is a conceptual partially working mock up built using an early beta release. In other words, do not use it as reference or consider it a blueprint for how you should proceed. If there is enough interest I might submit an article to OTN on the subject. Anyone like the idea? Better yet, if you’re not one of my customers please do consider contacting me! I’d love to help build a Real Time DW solution with OWB!

OWB now includes the ability to define, deploy, and setup Streams, Queues, Queue Tables, UserDefinedTypes, and propogations within the GUI. There’s a whole set of screens that you’ll see when the community preview hits the shelves. Unlike regular OWB deployments there are some additional requirements around streams administration locations, permissions, etc, but they are easily surmountable. Also, if you’re going to be doing real time DW you need to understand a bit about the underlying technology anyhow (not tons, but enough to know why you need to have Archive Logging turned on, etc).

Refer to the following PDF for some greater details on the conceptual, but here’s a not so good screenshot:

I’ve created a mockup of a BI solution that is fed by a CRM (Customer Data Hub perhaps) and a Subscription Management Application for this example. You can see that conceptually this involves both systems sending messages either from the APPLICATION LEVEL (JMS or some other messaging technology) or the DATABASE LEVEL (with DML Stream Captures running in Oracle). In other words, we have multiple places we can get different pieces of data and the application doesn’t necessarily have to be “REAL TIME ENABLED” to send real time data. Oracle can do that on it’s behalf using the Streams technology!

Overall what this looks like is we setup the various Streams, Capture Processes (DML), Queue Tables, and Types (based on our source tables) to support our real time system. Note that the screenshot does not include the Streams on the source system or the Capture Process definitions. This only includes the DW side Streams, Queues, Dimensions, etc.

I’ve built three real time mappings (TRICKLE FEED) which in concert receive messages to add Dimension records (SCD2) and insert new Cube records (transactions). Notice this is a greatly simplified example entirely ignoring what I consider a best practice of loading into a normalized warehouse, then updating marts based on the warehouse (a la CIF methodology). Also these are all assuming to changes (ie, record corrections) just straight clean data! We should all be so lucky!

One receives updates from the CRM application and performs SCD on the appropriate Dimension objects.

The others receives event messages from a transaction based system and inserts records into Cubes.

This isn’t quite as much detail as I would like to have gone into, and I’ll quickly repeat my warning… This is just some mockups and conceptual work so don’t expect it to be accurate come OWB 10gR2 production time! I have some more thoughts on how to use this with Partition Exchange Loading to get a days “Cubes” built realtime throughout the day, and then at the end of the day move them over to the full history but that’s a whole nother article.

This blog is part of the OWB Paris Early Review series which reviews and comments on several new Paris features.

Must have IE to evaluate SQL Server?

Microsoft is spending millions upon millions to launch and promote their new SQL Server 2005 release. I’m guessing they want every developer and nerdy IT type to check it out. They want to get into the VLDB and HA corporate data centers, and claim some of those vi using, I can write x86 assembly if I want to, firefox using, developers and DBAs.

The irony?

10% of the web surfing population won’t be able to evaluate it because the SQL Server 2005 homepage doesn’t load with Firefox.
http://www.microsoft.com/sql/default.mspx

Any other Firefox users able to load the page? Or is this another example of “Drink the MSFT koolaid or be gone with you!”?

OWB 10gR2 : Embedded OMBPlus

Not a long entry, but I did want to post some information on a useful little feature in the Paris release of OWB.

OMBPlus is the Swiss Army Knife for OWB developers. It allows you to do “OWB Stuff” without using the OWB GUI. If you’ve ever had to do many repeatable things in OWB you’ll be thankful for the hours that OMBPlus can save you! More details on OMBPlus can be found on OTN.

Currently you have to fire up OMBPlus seperate from the application and run it kind of like a text shell (interactive or fed a script). In the next release of OWB they’ve put this interactive shell environment as one of the panels in the design center. This is SOOOO very convenient when building OMB scripts.

Make a note though, some things won’t work in the context of the embedded OMBPlus window. Basically, things that require connecting to the OWB repository in SINGLE USER MODE you’ll still have to fire it up seperately (such as creating UDOs, etc).

This blog is part of the OWB Paris Early Review series which reviews and comments on several new Paris features.