Archive for the 'General BI' Category

DMREVIEW EXECUTIVE VANISHES

Thursday, June 2nd, 2005

The editors of DM Review (many of you probably subscribe) are making a public appeal to help track down an executive that went missing a week ago in British Columbia.

Here’s hoping they end up connecting up with him soon.

Update: Dave was found and it wasn’t the news everyone was hoping for. Best wishes and condolences to Dave’s family and friends.

2gig + 2gig = 50gig

Friday, July 9th, 2004

I was recently writing up a volume and performance specification for a customer project when a discussion arose with the current DBA staff about volume projections. The intuitive thinking was the volume requirements for the BI/Data Warehouse would be the sum of the systems from which it sourced data. The group was thinking this would be a good way to approximate the required space for the system. I asserted this method is flawed, and had to suggest why that BI volume is proportionate but not necessariliy directly proportionate.
BI systems required much greater storage than the sum of their sources because:

  • Data are denormalized. With denormalizing data to increase query performance one increases storage from 1-10000 times (it depends on the data, perhaps more).
  • New Data are created. There are many analytically significant items that occur that never even show up in source systems. For instance, a “Sales Fact” will be closely related in volume to “Order Line Items” in a source system. However, many BI solutions have business events like “Customer Acquired”, “Customer Lost”. These new business events are the result of the source system data but had not previously existed anywhere else (it was just created).
  • Summaries/Aggregates for query performance. Much of the data storage requirements is a factor of how much performance is required from commmon data access patterns (ie, user reports and ad-hoc analysis). If there are small data sets and end user performance requirements are minimal then summaries won’t require a great deal of space.

So, make a point! A good way to estimate the actual storage requirements is to run sample datsets. Load a month or two of data using the summary/aggregate parameters you think will be required in your depoloyment. Examine the database for it’s storage utilization and take measurements. Measure it on one day, one week, one month, one year (if possible). Graph it. See what the data looks like. You could even build a function to calculate out the future based on the curve of growth.
Most importantly, be ready for it to be different than what you expect! A few reports will require a summary that will make your predictions seem way off base. It’s to be expected; BI is a system not an application!

Database Operation Complexity Reference

Thursday, July 8th, 2004

I mentioned this previously, but I’ve been reading “Principles of Distributed Database Systems.” I’m enjoying it, and it’s helping me solidify many of the concepts that I apply daily in my capacity as a Principal BI Solutions consultant. Database theory, and specifically as it relates to tuning is part of any professionals work with Oracle. We’ve all deciphered the performance implications and clues for improvements from EXPLAIN PLAN. I’ve always been told you want to be able to give Oracle the clues/configurations to enable filter of results (selection) before joining. It always made sense but I had never understood fully the concepts behind these recommendations. Until now…

I don’t espouse to understand all of the complexities behind these issues but I ran across a great reference chart. I wanted to post it here along with some numbers for demonstration as a quick reference for any professionals with understanding of quadratic, logarithmic, and linear scales to match those up with the operations we use on a day to day basis.

This is from the book mentioned above, however I’m sure it’s rather common knowledge.

OPERATION COMPLEXITY
SELECT O(n)
PROJECT (without dedup)
PROJECT (with dedup) O(n*log n)
GROUP
JOIN
SEMIJOIN
DIVISION
SET OPERATORS
CARTESIAN PRODUCT O(n²)

A quick look at some numbers in the orders mentioned yield the following “costs” in n operation times.

n O(n) O(n*log n) O(n²)
5 5 3.49 25
10 10 10 100
100 100 200 10000
1000 1000 3000 1000000
1000000 1000000 6000000 1E+12

Hope this helps provide a useful way to match up the database operations we use on a daily basis with the theoretical cost of such operations. Cheers!

Web Analytics meets SimCity

Thursday, July 1st, 2004

I have a customer with a few web properties. I am helping them sort out some of their web reporting needs and integrating them into a BI Environment. A few weeks back I mentioned to them, mostly jokingly, they should take a look at VisitorVille. They took the advice (althought it twasn’t really advice, just water cooler talk) and gave it a go. They pulled it straight away for technical reasons but I think they are still planning on hooking it up at some point.

It’s a rather unique product/service, and I’m not sure it’s applicable for many businesses, but I must say it is impressive looking. I think it just goes to show, there are many different and appropriate ways to view data enriched by a Business Intelligence system. I think there will always be wonderful and impressive ways to view data based on the particular data set, metaphor, and audience. I’m betting that the in between (data consumption, extraction, transformation, derivation, application of real time business logic, etc) will remain much the same, but presentation is where the creative types can add immense value and impact.

Too bad the business users can’t use VisitorVille like SimCity. Build another skyscraper (web page), a bus terminal from out of town (referral from an outside site) as easy as the real SimCity game allows. :) Website owners would LOVE that… Manage your website from a game terminal.