2gig + 2gig = 50gig

I was recently writing up a volume and performance specification for a customer project when a discussion arose with the current DBA staff about volume projections. The intuitive thinking was the volume requirements for the BI/Data Warehouse would be the sum of the systems from which it sourced data. The group was thinking this would be a good way to approximate the required space for the system. I asserted this method is flawed, and had to suggest why that BI volume is proportionate but not necessariliy directly proportionate.
BI systems required much greater storage than the sum of their sources because:

  • Data are denormalized. With denormalizing data to increase query performance one increases storage from 1-10000 times (it depends on the data, perhaps more).
  • New Data are created. There are many analytically significant items that occur that never even show up in source systems. For instance, a “Sales Fact” will be closely related in volume to “Order Line Items” in a source system. However, many BI solutions have business events like “Customer Acquired”, “Customer Lost”. These new business events are the result of the source system data but had not previously existed anywhere else (it was just created).
  • Summaries/Aggregates for query performance. Much of the data storage requirements is a factor of how much performance is required from commmon data access patterns (ie, user reports and ad-hoc analysis). If there are small data sets and end user performance requirements are minimal then summaries won’t require a great deal of space.

So, make a point! A good way to estimate the actual storage requirements is to run sample datsets. Load a month or two of data using the summary/aggregate parameters you think will be required in your depoloyment. Examine the database for it’s storage utilization and take measurements. Measure it on one day, one week, one month, one year (if possible). Graph it. See what the data looks like. You could even build a function to calculate out the future based on the curve of growth.
Most importantly, be ready for it to be different than what you expect! A few reports will require a summary that will make your predictions seem way off base. It’s to be expected; BI is a system not an application!