Category Archives: General BI

Mark Madsen ETL evaluation slides

Mark Madsen, who has an excellent blog on various DW/BI topics, has posted slides from his Portland DAMA presentation on selecting ETL vendors. The slides are EXTREMELY informative and I suggest anyone considering doing some ETL work to browse through his work. Mark, thanks for such an excellent contribution, and I’m sorry I missed your presentation! I believe Mark works with TDWI on ETL tool topics, so consider signing up for one of his sessions.

My favorite from the slides, which had me laughing out loud, was the image to describe the proclomation that EII can be a “virtual data warehouse” on page 12! Excellent!

Must have IE to evaluate SQL Server?

Microsoft is spending millions upon millions to launch and promote their new SQL Server 2005 release. I’m guessing they want every developer and nerdy IT type to check it out. They want to get into the VLDB and HA corporate data centers, and claim some of those vi using, I can write x86 assembly if I want to, firefox using, developers and DBAs.

The irony?

10% of the web surfing population won’t be able to evaluate it because the SQL Server 2005 homepage doesn’t load with Firefox.

Any other Firefox users able to load the page? Or is this another example of “Drink the MSFT koolaid or be gone with you!”?


Working daily with people who are trying to measure and understand their world through the use of technology and BI methodologies I often hear lots of “things” that are important to determine.

  • What is this years top 5 products and what is their annual sales growth for the last 5 years?
  • Which company division has the most profitable customers, and which division has least profitable customers?
  • What time of day, in a registered website visitors home time zone are pages viewed on our website split by category?

In other words, there are some very specific things people want to know and brag about both within the company and externally to investors, analysts, and the media.

This predisposes me to question numbers I hear anywhere. What’s the qualification, what little keyword allows this company to say they are the top in their cateogory? Company XYZ is the Number 1 in Sales (in Asia Pacific small to midsized healthcare providers not owned by government and groups exceeding 1billion market cap for fiscal year 2003). We’ve all seen it…

One of my online music stations had a refreshingly simple claim to fame today that made me laugh out loud:

“Total Country. Rated #1 amoungst people who really like us!”

A refreshingly honest figure!

Ingres sails from Computer Associates

I’ve just started playing with Ingres recently (last 12 months). It’s a powerful DBMS that was released under an Open Source license last year. From what I gather about the history of the database it has been kind of a “hot potato” being passed from university to company to company to company, etc.

Feature for feature Ingres appears to be the most advanced Open Source database available. However, since it has been released under the CATOSL it has not resembled a community driven OSS project. There is still no public access to the source code repository, and as far as I know, there has not been source contributions from anyone outside of CA. The CATOSL is a “funny” OSI approved license that I think also hinders the uptake of Ingres.

However, all that could change, starting today.

A venture capital firm has purchased “Ingres” from CA and launched a company focusing entirely on the Open Source database. This company has an opportunity to capitalize on a starting point most OSS projects could only dream of (starting with a product that is deployed with mission critical applications at more than 5000 customer sites). That’s just where they start though… their future must include turning Ingres into a full scale Open Source project and community. This means public discussion forums, public source code control, welcome third party contributors, peer to peer information sharing, user based support, etc. I think Ingres (company and project) would also be VERY well served to trade the off color CATOSL license for a commercial friendly OSI approved license.

Welcome Ingres, Inc. to the marketplace! It’s an interesting one with Oracle, Microsoft, and IBM all providing “free” versions of their DB now and passionate communities in the MySQL and PostGres projects.

As a die hard Oracle consultant I need much more information to draw conclusions about Ingres… I’ve been in touch with CA and Ingres, Inc. I hope to provide more information and a more detailed evaluation as time permits. Stay tuned for more!

Thoughts on "BI for the masses"

Oracle has a star database. As Charles Phillips refers to the database, it’s the 747 of databases. The products that the Oracle Data Warehousing/Business Intelligence teams pump out are quite capable and are feature rich. There is little that I can NOT provide for my customers using this stack of powerful tools (Oracle DB, Oracle Warehouse Builder, Oracle Discoverer/Portal/BI Beans).

That being said, I’ve realized how inaccessible these tools are for “the masses.” The qualified, smart, analytical masses that need easy to use tools to build help them collect, analyze, and report on their organizations information. They are complicated, require rather extensive knowledge of “Oracle-isms” etc. To date, there are very few BOOKs on any Business Intelligence specific Oracle product. Books reflect a large network of solution providers and consultants. ie, other providers who have picked their preferred tool and are committing to learning, using, and teaching it to customers. Large communities of providers, training resources, and books reflect a support network and makes uptake of a technology MUCH MUCH easier for customers. They don’t have to learn from the manual which is VERY difficult… They can learn from the distilled knowledge of others, in a more participatory manner.

From a idealogical perspective, I’m not exactly drawn to Microsoft products. However, they are having significant success in building this ecosystem around their BI/SQL Server offering. While there exists NO BOOK on Oracle Warehouse Builder and only ONE BOOK on Discoverer here is a list of the scheduled books for Analysis Services 2005 release in BETA.

That’s right, there are 11 books being written about similar Microsoft products while their product is still in BETA! What does the Oracle BI/DW community think about this? I REALLY REALLY have to get comments going on my site… 🙁

Pentaho Milestone 2 release

Since I probably piqued some interest with this blog, I figured I should post an update…

The folks at Pentaho have released some actual software. I’m head deep in an OWB Paris project so I’ve had ZERO time to have a look. I’d love for anyone who’s had a look to email me and let me know their impressions.

From their release briefing:

Using this release, you will be able to experience the streamlined install process and interact with a number of components and samples.

  • Reporting
    how to run reports, burst different content to different users, and parameterize reports.
  • Business rules
    how to include and use business rules in the creation and delivery of content.
  • Email
    how to send the results of a business rule or report creation to an email address, and how to do email bursting.
  • Printing
    how to print a report to a selected printer, how to do batch printing, and how to print bursting (applying different report parameters to individual printers).
  • Workflow
    how to initiate a workflow and pass parameters to it.
  • Bursting
    how to deliver customized versions of a generic report to different email addresses or printers
  • Scheduler
    how to schedule the actions of the Pentaho BI Platform
  • Web Services
    how to access the actions of the Pentaho BI Platform using web services
  • Navigation
    how to organize and describe content to users using Java Server Pages or portlets  
  • Many of the visual features such as wizards – you may have heard discussed or seen demonstrated are not scheduled for delivery until the next milestone release. Please bear this in mind as you use the product.

Open Source BI – I like Pentaho

Business Intelligence software, databases, and their supporting hardware are expensive. I mean really, really expensive (hundreds of thousands to millions of dollars). Many people working in the Business Intelligence/Data Warehousing fields have seen their “operational application” colleagues adopting open source solutions (Linux, JBoss, Eclipse, Apache, etc.) but have seen little attention paid to the software required to build and deliver Business Intelligence. That is beginning to change.

I’ve blogged about this before, specifically my experiences with downloading and testing Mondrian, an open source ROLAP server written in Java. It appears as if there is some gaining momentum and maturity of projects suitable for BI in the Open Source(OS) world. I’ve felt for some time that the open source community had not embraced BI in quite the same way they have other applications of technology. It is, in earnest, a technology stack to make bigger companies bigger and smart companies smarter. While these precepts aren’t in opposition of open source ideals, they aren’t what typically motivates communities of developers to band together to make software for free (ie, change the world, provide a framework used by 10,000 websites, etc.).

The state of open source BI was relatively slim not too long ago. There were a variety of possible toll sets one could use for ETL (Clover, Enhydra Octopus), some initial OLAP components (Mondrian, JPivot), some portal frameworks for dashboards (JetSpeed, JBoss Portal), and some databases with maturity for DW situations with smaller volumes (MySQL, Postgres). Things have been heating up this past year, and we should review whats going on in the Open Source BI realm. The lead is buried, make sure you check out Pentaho at the bottom.

CA’s Open Source release of Ingres
Albeit a funny OSI approved license (there are many provisions which will scare away the OS purists, and make others at least think twice about including it in their products or service) Ingres is officially open source and free. Ingres has some pretty significant “enterprise” features including replication, partitioning, and “in the works” linux clustering (a la RAC). This is great news because Ingres is a rather mature database and is better suited for large DW volumes than MySQL and PostGres. It is noticeably (and perhaps critically) lacking the vibrant community required to increase uptake. At this point it feels like CA is still the only one “interested” in Ingres. This might change, but I believe the funny CATOSL has hindered acceptance from open source communities.

Netezza/DATAllegro are using open source
These two providers of DW appliances are using open source databases as part of their solution. It’s a mixed technology stack, which means that unless you purchase the appliances you will benefit from none of the work that these two companies have put into their implementations. One uses Postgres, the other uses Ingres. There must be quite a bit of technology surrounding it to make it actually work for corporate DW environments. Netezza is actually doing rather well I believe, and some of the bigger vendors are starting to “see them on the radar” as a player in the space.

GreenPlum (aka Metapa) takes another shot
When Metapa wasn’t getting the traction with marketing their inexpensive proprietary Clustered DB implementation they figured they needed something to get more traction. Open Source is powerful enough that even a few years into the hype it still attracts attention. They relaunched themselves as an Open Source solution and are sponsoring the BizGres project (a few extensions to PostGres that are useful for BI environments) along with allowing the single instance version of their product to be used for free. I don’t think they’ll get the OS community embrace they desire because people are discerning these days; the only interesting work GreenPlum is doing is related to their MPP and shared nothing clustering technology which is very much NOT open source. I don’t think they’ll get the OS thrust they expected, because they are only opening their kimono an inch, not even a halfway mark.

Mondrian/JPivot releases
These two projects underwent new releases this year that provided the most visible part of an open source DW/BI system their legs. While not comparable to commercial OLAP interfaces they are certainly suited for ISV/Developers to embed in their application. These are great components for including in a project, and if your report consumers don’t really care to write their own reports (a la graphical report builder) and just want to pivot and page this could be an excellent, inexpensive solution.

BIRT and JasperReports are actually pretty good
Two commercially backed (one by Actuate, the other by JasperSoft) projects that are building the basis for business quality reports. Don’t turn off your Crystal installation yet because these both have a way to go, but they’re improving at a steady pace.

Pentaho Nation
This is truly the most exciting thing I’ve found in the Open Source BI space, and they’ve just begun their work so I’m running on faith at this point. Industry veterans who are passionate about BI and open source have pooled their minds and money (they’ve made $$ from previous entrepreneurial activities) to build a pure, 100% open source distribution for BI. They are collecting various open source projects, building their own components and releasing the whole thing as open source. A partial list of the projects they are planning (no official distro yet): Mondrian OLAP server, JPivot, Firebird RDBMS, Enhrydra ETL, Shark and JaWE, JBoss, Hibernate, JBoss Portal, Weka Data Mining, Eclipse, BIRT, JOSSO, Mozilla Rhino.
The company will follow in RedHat footsteps and make money on support, training, and consulting. Their plans are ambitious, but they are focused on assembling and configuring all these disparate projects into a comprehensive platform that will be at least comparable to the “big boys” at Hyperion, Cognos, Microstrategy, etc.

They are engaging the community, clearly understand the need in the space, and are committed to the ideals of getting paid for solutions instead of software. They are certainly strong in the presentation, dashboard, BPM/workflow, OLAP end of the spectrum but don’t appear to be including much in the ETL/DW end (there is some, but it appears to be for data movement and loading as opposed to building a DW). I’m not sure if it’s strategic or not, but it might makes sense. Most people adopting an open source BI platform for their reporting users will feel comfortable rolling their own ETL/DW for the backroom. It should also be noted that they haven’t made any releases yet, so what we’re seeing is all conceptual now but they’ll be rolling something out sometime in 2005. It appears as if the founders have a track record of “doing what they say they’ll do.”

What does this all mean?
There are three things that will happen as the Open Source and BI worlds start dating.

  1. Hardly anything for your current BI project and technologies. It is still emerging and is just now being utilized by early adopters.
  2. Cost pressure on the “big boys” will occur as the maturity of these components provide at least comparable options. Currently the small number of vendors along with their constantly increasing prices will show up as an area to be trimmed (ironic enough probably in a financial report provided inside the software in question). I don’t believe that it will have a significant impact, but will have a small impact over the next 3-5 years. It will also affect prices of BI OEM and inclusion of BI capabilities in vertical applications (more BI in existing products).
  3. Increased adoption of BI at small and mid sized business who can now afford to enter into the BI space. Previously inhibited by the exorbitant software costs business can now spend a few thousand dollars to start their foray into BI.

BI Data goes Interactive

Netflix has been using the old Amazon trick for a while now. You know, people who liked book X also like this rubber spatula. Infamous and lucrative…

Netflix has taken this to a whole new level. They’ve mined the suggestion data, cross referenced it with my friends data, and provided me recommendations based on what my friends liked and I like. As if that wasn’t hip enough, they’ve turned this mined information into an interactive experience:

Two things that keep people engaged in your site: relevant content (mined) and interactive media (participation). Netflix hit this one out of the park, in my opinion!


The editors of DM Review (many of you probably subscribe) are making a public appeal to help track down an executive that went missing a week ago in British Columbia.

Here’s hoping they end up connecting up with him soon.

Update: Dave was found and it wasn’t the news everyone was hoping for. Best wishes and condolences to Dave’s family and friends.

2gig + 2gig = 50gig

I was recently writing up a volume and performance specification for a customer project when a discussion arose with the current DBA staff about volume projections. The intuitive thinking was the volume requirements for the BI/Data Warehouse would be the sum of the systems from which it sourced data. The group was thinking this would be a good way to approximate the required space for the system. I asserted this method is flawed, and had to suggest why that BI volume is proportionate but not necessariliy directly proportionate.
BI systems required much greater storage than the sum of their sources because:

  • Data are denormalized. With denormalizing data to increase query performance one increases storage from 1-10000 times (it depends on the data, perhaps more).
  • New Data are created. There are many analytically significant items that occur that never even show up in source systems. For instance, a “Sales Fact” will be closely related in volume to “Order Line Items” in a source system. However, many BI solutions have business events like “Customer Acquired”, “Customer Lost”. These new business events are the result of the source system data but had not previously existed anywhere else (it was just created).
  • Summaries/Aggregates for query performance. Much of the data storage requirements is a factor of how much performance is required from commmon data access patterns (ie, user reports and ad-hoc analysis). If there are small data sets and end user performance requirements are minimal then summaries won’t require a great deal of space.

So, make a point! A good way to estimate the actual storage requirements is to run sample datsets. Load a month or two of data using the summary/aggregate parameters you think will be required in your depoloyment. Examine the database for it’s storage utilization and take measurements. Measure it on one day, one week, one month, one year (if possible). Graph it. See what the data looks like. You could even build a function to calculate out the future based on the curve of growth.
Most importantly, be ready for it to be different than what you expect! A few reports will require a summary that will make your predictions seem way off base. It’s to be expected; BI is a system not an application!