I had listen to Stephen A. Brobst, CTO at NCR Teradata about “Best Practices in Meta Data Management and Enterprise Data Warehouse Deployment” this morning. I was hoping to grok some details about new metadata management techniques, but what the presentation was much more focused on the “deployment/architecture” side. That being said, I think it was MUCH more useful for the the audience as a whole to cover the deployment in as much detail; I personally didn’t find it all that groundbreaking.
Summary: Build an EDW based on a 20,000 ft blueprint, integrate your data into a single location (say, perhaps, a massively scalable single system image shared nothing database) using a relational schema, build a subject area at a time, and star schemas only when performance is an issue. Clearly the architecture is advocating the Oracle/Teradata/etc view of the world that says ONE GIGUNDOUS RELATIONAL warehouse with various semantic (dimensional) and performance (materialzed denormalized structures) views into that world. I’m not being sarcastic; I model most of my customer implementations off the CIF and think it’s a good approach from a TCO perspective.
The key takeaway remains: if the data isn’t integrated, it won’t be useful. An EDW promotes this, but it’s not the only way. You start to realize more value as the richness of relationships and entities increase in the integrated view.
One of the things I have a beef with is that the “Semantic Layer” (metadata for dimensional modeling, data freshness, etc) can be used instead of ETL and building physical star schemas. I make no issue that the reporting tools, database, and platform can adequately do this, but rather how is it managed? For example, if I define my dimension LOGICALLY, and let the REPORTING tool build a cross tab based on that dimension that should work. BUT, how is that managed as part of the information lifecycle? I’ve seen VERY few tools that can tell you: on day X I was generating this dimension this way, and provided it to these 20 reports using this particular method. ETL tools building PHYSICAL structures are usually managed (think source code, or some sort of build and release system). In other words, if you see a report based on the “CUSTOMER COUNTRY” and a date one can say PRECISELY how that was generated because there’s a managed solution (ETL, Database Structures) in a repository somewhere that tell you what your logical and physical ETL were at that point in time. Good luck doing that when someone is able to change this on the fly in a clever “report writing” tool.
Sorry Discoverer gurus… I’ve never been a fan of “faking” real dimensional structures with clever SQL generation. Not because it doesn’t work or won’t peform, but the management life cycle of reporting tools and configuration are ages behind the ETL tools. Not saying they’re great, but… you get the point.
Overall I enjoyed Stephen’s presentation. GREAT SPEAKER, actually! My favorite line from the day: “Data Marts are kind of like rabbits in the data center. They just start multiplying!.” 🙂