Monthly Archives: July 2004

Competitive Advantage for Open Source?

I was reading on slashdot this morning that Mozilla has been officially recognized as a 501(c)(3) by the United States federal government. Getting qualified as a charitable non-profit, giving software to the world can be a significant competitive advantage for the Mozilla directly and Open Source in general.

Being a non-profit can provides significant advantage to Mozilla, and it’s respective aims. There will be opportunities to both decrease outlays on goods (hardware, servers, etc) and services (professionals donating time can reasonably deduct the hourly rate for those pro-bono). Mozilla could, depending on how far they wish to stretch the limits of the non-profit, provide tax breaks to open source developers in the US contributing at a reasonable rate. I have no idea if they plan on doing this, but it’s an interesting premise all the same and I think it would be just brilliant. There are also advantages from a revenue perspective.

Companies wishing to support Open Source initiatives had to do so previously by funding that internally through developers time, etc. While this time is deductible as a business expense, it does appear to deduct against directly the business unit/department/project that is making said contribution. Companies now have the ability to make a greater contribution and have that contribution to the world of science, and humanity reflected in their tax bill. In theory, if Mozilla manages their fundraising efforts properly they may be able to significantly increase the amount of $$ they could spend on a central development team adding clarity and continuity to projects that are full of heart, but sometimes lack focus.

I’m not saying that Open Source is just as much a worthy cause as many of the other humanitarian and charitable organizations. At the end of the year, I’d still likely spend a few hundred dollars that has a direct effect on saving and improving lives. Open Source does that, but in a different and proportionately smaller ways. However, providing this logistical benefit to companies wishing to support Open Source is a move in the right direction.

UML from vi

Picked this up from orablogs.com this morning.

The many UML editors have a lot of whizbang features. Useful for sure, especially for environments that are heavily building intricate applications using methodologies that benefit from using the varied UML diagrams (Activity, Collaboration, Component, Deployment, Model Diagram, Sequence, Statechart, Static Structure, Use Case).

In practice, most of us use just a few and use them in a reptitive fashion. Build a sequence diagram that documents a use case, print it, include it in documentation. Repeat for all 10 use cases. In my humble opinion, doing repetitive tasks in a gui is time wasted.

Consider using this tool rather than mess with a GUI for building simple sequence diagrams. Since it takes text as input, you can even run your current .java files through it to render HTML.
It generates the image for documentation, and since the “graph” is actually just a text file one could check it into CVS. I’ve not worked on Java for some time now, but if I ever need to get back on that bicycle, I’d strongly consider using a “UNIX shell programmers” UML tool.

2gig + 2gig = 50gig

I was recently writing up a volume and performance specification for a customer project when a discussion arose with the current DBA staff about volume projections. The intuitive thinking was the volume requirements for the BI/Data Warehouse would be the sum of the systems from which it sourced data. The group was thinking this would be a good way to approximate the required space for the system. I asserted this method is flawed, and had to suggest why that BI volume is proportionate but not necessariliy directly proportionate.
BI systems required much greater storage than the sum of their sources because:

  • Data are denormalized. With denormalizing data to increase query performance one increases storage from 1-10000 times (it depends on the data, perhaps more).
  • New Data are created. There are many analytically significant items that occur that never even show up in source systems. For instance, a “Sales Fact” will be closely related in volume to “Order Line Items” in a source system. However, many BI solutions have business events like “Customer Acquired”, “Customer Lost”. These new business events are the result of the source system data but had not previously existed anywhere else (it was just created).
  • Summaries/Aggregates for query performance. Much of the data storage requirements is a factor of how much performance is required from commmon data access patterns (ie, user reports and ad-hoc analysis). If there are small data sets and end user performance requirements are minimal then summaries won’t require a great deal of space.

So, make a point! A good way to estimate the actual storage requirements is to run sample datsets. Load a month or two of data using the summary/aggregate parameters you think will be required in your depoloyment. Examine the database for it’s storage utilization and take measurements. Measure it on one day, one week, one month, one year (if possible). Graph it. See what the data looks like. You could even build a function to calculate out the future based on the curve of growth.
Most importantly, be ready for it to be different than what you expect! A few reports will require a summary that will make your predictions seem way off base. It’s to be expected; BI is a system not an application!

Database Operation Complexity Reference

I mentioned this previously, but I’ve been reading “Principles of Distributed Database Systems.” I’m enjoying it, and it’s helping me solidify many of the concepts that I apply daily in my capacity as a Principal BI Solutions consultant. Database theory, and specifically as it relates to tuning is part of any professionals work with Oracle. We’ve all deciphered the performance implications and clues for improvements from EXPLAIN PLAN. I’ve always been told you want to be able to give Oracle the clues/configurations to enable filter of results (selection) before joining. It always made sense but I had never understood fully the concepts behind these recommendations. Until now…

I don’t espouse to understand all of the complexities behind these issues but I ran across a great reference chart. I wanted to post it here along with some numbers for demonstration as a quick reference for any professionals with understanding of quadratic, logarithmic, and linear scales to match those up with the operations we use on a day to day basis.

This is from the book mentioned above, however I’m sure it’s rather common knowledge.

OPERATION COMPLEXITY
SELECT O(n)
PROJECT (without dedup)
PROJECT (with dedup) O(n*log n)
GROUP
JOIN
SEMIJOIN
DIVISION
SET OPERATORS
CARTESIAN PRODUCT O(n²)

A quick look at some numbers in the orders mentioned yield the following “costs” in n operation times.

n O(n) O(n*log n) O(n²)
5 5 3.49 25
10 10 10 100
100 100 200 10000
1000 1000 3000 1000000
1000000 1000000 6000000 1E+12

Hope this helps provide a useful way to match up the database operations we use on a daily basis with the theoretical cost of such operations. Cheers!

Quick Search Solution in Oracle

Found a nice post on rittman.net in reference to a simple search solution posted on Eric Mortensen’s blog.

It allows for a simple interface and a leverages the basic Oracle SQL wildcard (%) to implement a search on a full table basis. It’s a nice straightforward solution that takes all fields and concats them into one field that is delimited. Using this format, one can quickly right simple SQL that does simple searches based on a simple field replacement.

I wonder if this couldn’t be extended to use regular expressions, and if so which would provide better performance. I’m certain that oracle is tuned for text scanning on %, so I’m not sure that regular expressions would beat out the wildcard searching.

Perhaps this could even be implemented as a view, but I’m sure there might be trade offs by actually having to access the records in to the other table. Perhaps a fast refresh materialized view or an actual materialized view could solve that problem.

10g, Productivity with Choice

During a recent conversation about Open Source/Industry Standards/Vendor solutions I ended up spouting out the fact that Oracle’s philosophy on it’s offering is “productivity with choice.” I further realized I had no idea what that actually meant. So I looked it up… I am by no means very proficient on JDeveloper (I’ve fired it up a time or two to look at some BI Beans components). I must say I don’t fully know the practicalities about the JDeveloper package but I can say I understand what Oracle is doing. For 80% of what you will do on a day to day, make that wizard based and leverage RAD environments. For the rest, roll your own and Oracle has a laundry list of acronyms to suit.

Web Analytics meets SimCity

I have a customer with a few web properties. I am helping them sort out some of their web reporting needs and integrating them into a BI Environment. A few weeks back I mentioned to them, mostly jokingly, they should take a look at VisitorVille. They took the advice (althought it twasn’t really advice, just water cooler talk) and gave it a go. They pulled it straight away for technical reasons but I think they are still planning on hooking it up at some point.

It’s a rather unique product/service, and I’m not sure it’s applicable for many businesses, but I must say it is impressive looking. I think it just goes to show, there are many different and appropriate ways to view data enriched by a Business Intelligence system. I think there will always be wonderful and impressive ways to view data based on the particular data set, metaphor, and audience. I’m betting that the in between (data consumption, extraction, transformation, derivation, application of real time business logic, etc) will remain much the same, but presentation is where the creative types can add immense value and impact.

Too bad the business users can’t use VisitorVille like SimCity. Build another skyscraper (web page), a bus terminal from out of town (referral from an outside site) as easy as the real SimCity game allows. 🙂 Website owners would LOVE that… Manage your website from a game terminal.

CVS is pretty cool

Ok, this is by no means timely or newsworthy… But all the same I wanted to put some thoughts together about release management using CVS. I had a recent conversation with one of the Application Developers for a customer of mine in Boston, MA. This client had been using some pieces of CVS for some of their website release management. With the recent departure of their BUILD-MEISTER for greener pastures the group was left without a real CVS guru. It never hurts to write about a few of the great features of CVS that are powerful when embraced by developement teams.

CVS is Concurrent as it’s name suggest. OK, so this isn’t really a surprise, but everything about CVS is built around the fact that there are going to be multiple people needing to do work on the contents of the repository.

CVS builds on some great UNIX tools. Diff/merge and the likes are leveraged as part of the CVS system so that you can use some pretty cool packages and viewing mechanisms to resolve conflicts. Because it leverages some of these things, you have some flexibility for extensions. Also, one can use it remotely and over ssh to further enhance the availability and security of the system.

CVS is widely used, so it will be easy to use with other products. Developers/Vendors will likely provide support for CVS if they are building packages that are meant to integrate with a Source Control System. As it pertains to this particular client, the easy integration with Ant for building, jarring, and deploying their applications. I’ve built some rather robust Ant build scripts that can centralize most of a QA/build process. There’s plenty of web/windows/*nix clients for interfacing with a CVS repository which means your not going to get locked into just one vendors particular client/repository.

CVS can help you manage releases. Do development on 2.0 and be able to fix critical bugs in the 1x branch. Then, fold it in using some of the merging capabilities. The CVS tool is immensely useful for the administrator who is aware of branching, merging, and tagging.

Now, if only Oracle Development could be “managed in text files” so I could put it into CVS… 🙂

Jini, the silent coming of age

I have always been fascinated by Jini. I’ve attended two of the Jini community meetings and kicked the tires on many of the research projects at jini.org. In many ways it was a technology ahead of it’s time and since it didn’t make a huge SPLASH during the dot com boom, it hasn’t been adopted en masse.
NOTE: many of these links may require registration for the jini.org community and acceptance of the Sun SCSL (which is also a deterrent to the growth of this wonderful technology)

There are many community members much more familiar than I am on the state of Jini adoption. However I do continue to hope for something to happen with regards to it’s uptake. It still seems to be very much on the fringe. Surprising, there are fortune 100 companies using Jini.

There are more airline reservations made on Jini based systems (orbitz, aa.com, nwa.com) than any other electronic system (according to some information from Orbitz). They even won the

Duke’s Choice Awards — Orbitz has been selected as a winner of the 2004 Duke’s Choice Award, recognizing the “best of the best” from among all the cool projects going on in the world of Java technology. Orbitz team members are presenting TS-2614 at JavaOne. See why they won this award.

It’s a good technology that didn’t originally come with the whizbang set of installation wizards that the current frenzy of the dot com era required. It originally required the skill and aptitude of distributed computing engineers to recognize it’s benefits which were firmly placed on the fringe. Some of the projects that are being built on top of Jini offer some great additions to problems that Jini has a competitive advantage in solving.

My personal interest, and if I ever have one of those things that consultants refer to as “Extended Research Periods” would be of how it could address some of the data warehouseing issues I face on a day to day basis. As someone knowledgable with the domain knowledge of a problem (OLAP, huge fact tables, distributed query processing) could I use the current Jini technology and enhancements (such as rio, computefarm, etc) to build a wonderful distributed BI infrastructure? 🙂 Stay tuned… perhaps I’ll have one of those periods of time coming up!