DynamoDB: Built in Time Dimension support!

DynamoDB (aka LucidDB) is not just another column store database. Our goal is being the best database for actually doing Business Intelligence; while that means being fast and handling large amounts of data there’s a lot of other things BI consultant/developers need. I’ll continue to post about some of the great BI features that DynamoDB has for the modern datasmiths.

First feature to cover that’s dead easy, is the built in ability to generate a time dimension, including a Fiscal Calendar attributes. If you’re using Mondrian (or come to that, your own custom SQL on a star schema) you need to have a time dimension. Time is the most important dimension! Every OLAP model I’ve ever built uses one! It something that you, as a datasmith will need to do with every project; that’s why we’ve built it right into our database.

Here’s a dead simple way to create a fully baked, ready to use Time Dimension to use with Mondrian.

-- Create a view that is our time dimension for 10 years, with our
-- Fiscal calendar starting in March (3)
create view dim_time as select * from
table(applib.fiscal_time_dimension (2000, 1, 1, 2009, 12, 31, 3));

OK, that’s it. You’ve created a Time Dimension! * see NOTE at end of post.

So, we’ve created our time dimension, complete with a Fiscal calendar for 10 years in a single statement! Awesome – but what does it contain?

-- Structure of new time dimension
select "TABLE_NAME", "COLUMN_NAME", "DATATYPE" from sys_root.dba_columns
where table_name = 'DIM_TIME';
+-------------+---------------------------------+-----------+
| TABLE_NAME  |           COLUMN_NAME           | DATATYPE  |
+-------------+---------------------------------+-----------+
| DIM_TIME    | FISCAL_YEAR_END_DATE            | DATE      |
| DIM_TIME    | FISCAL_YEAR_START_DATE          | DATE      |
| DIM_TIME    | FISCAL_QUARTER_NUMBER_IN_YEAR   | INTEGER   |
| DIM_TIME    | FISCAL_QUARTER_END_DATE         | DATE      |
| DIM_TIME    | FISCAL_QUARTER_START_DATE       | DATE      |
| DIM_TIME    | FISCAL_MONTH_NUMBER_IN_YEAR     | INTEGER   |
| DIM_TIME    | FISCAL_MONTH_NUMBER_IN_QUARTER  | INTEGER   |
| DIM_TIME    | FISCAL_MONTH_END_DATE           | DATE      |
| DIM_TIME    | FISCAL_MONTH_START_DATE         | DATE      |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_YEAR      | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_QUARTER   | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_MONTH     | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_END_DATE            | DATE      |
| DIM_TIME    | FISCAL_WEEK_START_DATE          | DATE      |
| DIM_TIME    | FISCAL_DAY_NUMBER_IN_YEAR       | INTEGER   |
| DIM_TIME    | FISCAL_DAY_NUMBER_IN_QUARTER    | INTEGER   |
| DIM_TIME    | FISCAL_YEAR                     | INTEGER   |
| DIM_TIME    | YEAR_END_DATE                   | DATE      |
| DIM_TIME    | YEAR_START_DATE                 | DATE      |
| DIM_TIME    | QUARTER_END_DATE                | DATE      |
| DIM_TIME    | QUARTER_START_DATE              | DATE      |
| DIM_TIME    | MONTH_END_DATE                  | DATE      |
| DIM_TIME    | MONTH_START_DATE                | DATE      |
| DIM_TIME    | WEEK_END_DATE                   | DATE      |
| DIM_TIME    | WEEK_START_DATE                 | DATE      |
| DIM_TIME    | CALENDAR_QUARTER                | VARCHAR   |
| DIM_TIME    | YR                              | INTEGER   |
| DIM_TIME    | QUARTER                         | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_OVERALL            | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_IN_YEAR            | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_IN_QUARTER         | INTEGER   |
| DIM_TIME    | MONTH_NAME                      | VARCHAR   |
| DIM_TIME    | WEEK_NUMBER_OVERALL             | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_YEAR             | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_QUARTER          | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_MONTH            | INTEGER   |
| DIM_TIME    | DAY_FROM_JULIAN                 | INTEGER   |
| DIM_TIME    | DAY_NUMBER_OVERALL              | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_YEAR              | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_QUARTER           | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_MONTH             | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_WEEK              | INTEGER   |
| DIM_TIME    | WEEKEND                         | VARCHAR   |
| DIM_TIME    | DAY_OF_WEEK                     | VARCHAR   |
| DIM_TIME    | TIME_KEY                        | DATE      |
| DIM_TIME    | TIME_KEY_SEQ                    | INTEGER   |
+-------------+---------------------------------+-----------+

-- Let's look at a few rows
select time_key_seq, time_key, yr, month_number_in_year, fiscal_year
, fiscal_month_number_in_year from dim_time;
+---------------+-------------+-------+-----------------------+--------------+------------------------------+
| TIME_KEY_SEQ  |  TIME_KEY   |  YR   | MONTH_NUMBER_IN_YEAR  | FISCAL_YEAR  | FISCAL_MONTH_NUMBER_IN_YEAR  |
+---------------+-------------+-------+-----------------------+--------------+------------------------------+
| 1             | 2000-01-01  | 2000  | 1                     | 2000         | 11                           |
| 2             | 2000-01-02  | 2000  | 1                     | 2000         | 11                           |
| 3             | 2000-01-03  | 2000  | 1                     | 2000         | 11                           |
| 4             | 2000-01-04  | 2000  | 1                     | 2000         | 11                           |
| 5             | 2000-01-05  | 2000  | 1                     | 2000         | 11                           |
| 6             | 2000-01-06  | 2000  | 1                     | 2000         | 11                           |
| 7             | 2000-01-07  | 2000  | 1                     | 2000         | 11                           |
| 8             | 2000-01-08  | 2000  | 1                     | 2000         | 11                           |
| 9             | 2000-01-09  | 2000  | 1                     | 2000         | 11                           |
| 10            | 2000-01-10  | 2000  | 1                     | 2000         | 11                           |
+---------------+-------------+-------+-----------------------+--------------+------------------------------+

Generating the Time Dimension is accomplished using DynamoDBs ability to include Java based UDF Table Functions. Table functions are really powerful – they allow a BI developer to write custom functions that output a “table” that can be queried like ANY OTHER TABLE (mostly). Check out the wiki page FarragoUdx if your interested.

And of course: download LucidDB and give it a whirl!

NOTE: To be candid, doing it as a view isn’t the best approach. For anything beyond tiny (5 million +) we should actually create the table, and do an INSERT INTO SELECT * FROM TABLE(fiscal_time_dimension).

Book Review: Pentaho Reporting 3.5 for Java Developers

I have two customers that if they had access to Will Gormans book, Pentaho Reporting 3.5 for Java Developers, they would not have needed me for their project! That’s how good the book is for those who need to embed Pentaho Reporting into their Java application.

The book is certainly geared towards Java developers, and specifically, developers you are trying to simply use the Pentaho reporting library. I’d venture to say that MOST customers should be using Pentaho; in this case, the book is useful as a reference, but the HOWTO past Chapter 3 would probably be lost on many users; except for Chapter 11 (see below).

However, for people trying to embed Pentaho reporting, WOW: THIS IS THE DEFINITIVE RESOURCE. Buy it, RIGHT NOW! The information it contains was locked in just a few peoples minds (Thomas, Bunch of People sitting at the “citadel” in Orlando aka Pentaho Employees, a handful of consultants). Will has unlocked it and I’m glad he did.

Will taught me something new in this book. In fact, I hope this is “new” in 3.5 which was release just a few weeks back. If it’s been around longer than I’m a total dolt. Chapter 11 covers how to add your own custom Expressions/Formulas to Pentaho (including the PRD).

At customer engagements, or when I put on my Pentaho hat and teach their public courses, or custom onsite training, I’m asked all the time: Can I make my own Reporting Functions and plug them into Pentaho Report Designer? Up until WIll showed me how to do it on page 281, I thought this was only possible for Pentaho (the company). Will gives us a step by step guide to add our own “DoMyCustomThing” to the Pentaho Report Designer. Customers can now create their own corporate expressions/functions they can leverage across hundreds of reports.

I’ll keep several copies on my shelf, and give it away to any current/future “embedded Pentaho Reporting” customers. Thanks Will for such a great book!

Dreamhost Uptime Numbers are TERRIBLE!

I don’t care what their marketing stats say, I have my own indepedent verification. I’ve been using Wormly for quite a while monitoring some of my demo sites, and other services that are part of Bayon and part of Dynamo. Since I was already paying for it, I figured I’d turn it loose on this blog (nicholasgoodman.com) and see what the uptime was like.

I always thought Dreamhost was a little skiddish, and my email box finds approximately one email per day with a failure, but i figured they were small, single request failures. Nope. The independent measuring of the uptime of this blog is a CRUDDY, CRAPPY, 97.6%.

200911042116
That’s pathetic! My blog is nothing special, an out of the box WordPress installation backed by their MySQL. I haven’t done any of my own installations, customizations (excepting a theme) and yet my blog uptime is awful. I’ve liked the dreamhost panel; it gives the “technical but uninterested in actually administering their own server” user a lot of power and I’d be willing to tolerate a little downtime (truthfully, anything above 99.5% is OK with me). But 97% uptime? Shyeah… Time to start looking.

Anyone have any suggestions for good WordPress / PHP / MySQL hosts? WIlling to pay top dollar and I’ll bring with me registrations for about 25 domains.

Instant Relief from MySQL Reporting Queries: Incremental Updates

Yesterday, I covered how you can do an initial “replication” of data from MySQL to DynamoDB and how this can improve performance, and save storage space. The follow on question becomes:

That’s Great Nick. But how do I do keep my data up to date?

We’ve got data in our Airline Performance dataset through 31-DEC-2007. I loaded 1 year, all of 2007, for the previous example. What happens when the FAA publishes their 2008 January results, and we’ve loaded the new months worth of data into MySQL?

MySQL:

select count(*) from otp.ontime; 8061223
select count(*) from ontime where FlightDate > ‘2007-12-31’; 605765
select count(*) from ontime where FlightDate <= ‘2007-12-31’; 7455458

DynamoDB:

select count(*) from FASTER.”ontime”; 7455458

So, we’ve added approximately 600k new records to our source system that we don’t have in our reporting system. How do we incrementally insert these records and get just the 600k new rows into our DynamoDB reporting instance?

Easy Easy Easy.

We’ve already done all the work, all we have to do is simply get records we haven’t processed yet! Should take just a few minutes to get our current table “up to date” with the one over in MySQL.

DynamoDB:

select max(“FlightDate”) from FASTER.”ontime”; 2007-12-31
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime” where “FlightDate” > DATE ‘2007-12-31’; 605765

In other words, let’s select from MySQL any records whose date is beyond what we have currently (2007-12-31).

select count(*) from FASTER.”ontime”; 8061223
select count(*) from FASTER.”ontime” where “FlightDate” > DATE ‘2007-12-31’; 605765

MySQL:
While the DynamoDB INSERT statement was running, the following SQL was being run on MySQL.

show processlist shows a SQL session with the following SQL:
SELECT * FROM `ontime` WHERE `FlightDate` > DATE ‘2007-12-31’;

A single SQL statement (insert into select * from table where date > last time) has you up to date for reporting! Long term we may look to work with Tungsten to be able to keep our data up to date using replication bin log records but for now, this simple pull based approach.

Instant Relief from Slow MySQL Reporting Queries using DynamoDB

Here’s the scenario. You’ve got a table in MySQL for reporting that has a few million rows, and is denormalized for reporting. You’ve got a Pentaho Report that is querying this MySQL table. You have two problems with the current report.

  1. Your users are complaining that the query is slow, and they have to wait around for longer than they’d like to see their report. (approx 40s)
  2. Your DBAs are cranky because they see the size of this table is getting bigger. (approx 1.8GB)

MySQL is fundamentally designed to be an OLTP database and while it does a fantastic job at that, its data warehouse features were built as “bolt on” additions. Can it be used for BI? Absolutely, I’ve used it a many customer sites. Does DynamoDB provide a better set of features/capabilities for doing BI? We think so! Are they both 100% open source? You bet;why not choose the right tool for the right job then?

DynamoDB (aka LucidDB) is a “purpose built for BI” database. What does that mean? Well, I’ll be blogging about a lot of features that speak to our philosophy of a complete “BI Database” not just a fast one. One of the features that makes LucidDB complete, and not just a drag racer, is its ability to connect to remote data sources via JDBC and retrieve data. If you’re doing simple table replications, you don’t have to use an ETL tool, or do export or imports, or LOAD DATA INFILEs, etc. Our ability to connect to remote databases and access them as “remote tables” makes retrieving data into DynamoDB as easy as “insert into mytable select * from remote_table.”

Back to our original issue with our current MySQL

Our report is slow, and our database is big. How slow? Well, not really that bad, but at about 40s per query run that’s enough to tempt your business analyst to go fetch a coffee instead of continuing his work. How big? Well, not really that big, but at about 1.8GB it’s starting to get non trivial in terms of tuning the I/O etc.

Our goal is to improve both using DynamoDB; we’ll leave MySQL as our main OLTP application. We’re not trying to replace it – in fact, we’ll embrace MySQL as the system of record and simply “slurp/report” off this table in a separate reporting environment.

It’s a two step process.

  1. Connect from DynamoDB to MySQL using a JDBC connector, access the remote table, and draw over the data using a simple INSERT statement.
  2. Change our Pentaho Report to use the DynamoDB JDBC connector instead of MySQL.

Our Pentaho Report is based on the following SQL

SELECT t.Carrier as “CARRIER”,
c as “C”, c2 as “C2”, c*1000/c2 as “C3” FROM
(SELECT Carrier, count(Carrier) AS c FROM ontime
WHERE DepDelay>10 GROUP BY Carrier) t JOIN
(SELECT Carrier, count(Carrier) AS c2 FROM ontime
GROUP BY Carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

200911022216
This takes approximately 40s to run on MySQL database running the same machine.

Step 1: Connect, and load the data into our DynamoDB table.

— Create DynamoDB reporting table first
create schema faster;

create table faster.”ontime” (
“Year” int,
“Quarter” tinyint ,
“Month” tinyint ,
….. Abbreviated for Brevity ….
“Div5TailNum” varchar(10)
);

— Get access the MySQL table OnTime in the OTP schema on host localhost
create schema MYSQL_SOURCE;
set schema ‘MYSQL_SOURCE’;

CREATE SERVER MYSQL_REMOTE_SOURCE FOREIGN DATA WRAPPER
sys_jdbc OPTIONS (
driver_class ‘com.mysql.jdbc.Driver’,
url ‘jdbc:mysql://localhost/otp?useCursorFetch=true’,
user_name ‘root’,
password ‘easy’,
fetch_size ‘1000’,
table_types ‘TABLE’,
schema_name ‘otp’);

import foreign schema OTP from server MYSQL_REMOTE_SOURCE into MYSQL_SOURCE;

— Load DynamoDB table from MySQL database directly
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime”;

Notice that last statement. You don’t have to export to intermediate files, or use an ETL tool (not that that’s bad, I’m a big fan of ETL tools!). You can use good old fashioned SQL to get data from a remote database into DynamoDB.

Step 2: Change the Pentaho Report to use the new connection.

We open up our report and change our connection from MySQL

200911022234
to DynamoDB

200911022236
NOTE: Until we finish our QA’ed builds we’re using LucidDB driver instead of DynamoDB but they are, one and the same.

We make some minor adjustment to the SQL (quoting some tables/etc) and rerun our query and Voila, our report runs in 10s down from 40s, an improvement of 400%.

How about storage? Our storage report shows that DynamoDB is using only .3 GB to store the same 7 Million records as compared to MySQL at 1.8GB, or 1/6 of the storage.

Not a bad investment of a few minutes of time, I’d say. DynamoDB (LucidDB) takes just a few minutes to install, and because of its focus on BI you should find things like retrieving data from remote data sources easy, and effective. Let’s be truthful here as well; once you speed up a report by 400% and reduce its storage by 6x your boss will be calling you a dynamo.

Notes: Full set of scripts posted here: mysql_relief.zip. Original queries and dataset from Vadim at MySQLPerformanceBlog.

DynamoBI: website? bits?

Well, what a soft launch it has been. 🙂

Some people have asked:

When are you going to get a website? Errr…. Soon! We soft launched a bit early, due to some “leaking information” but figured heck, it’s open source let’s let it all out. Soon enough, I swear!

Where can I download DynamoDB? Errr… you can’t yet cause we haven’t finished our build/QA/certification process.

However, since DynamoDB is the alter ego business suit wearing brother of LucidDB, just download the 0.9.2 release if you want to get a sense of what DynamoDB is.

There are 3 built binaries (Linux 32, Linux 64, and Windows 32): http://sourceforge.net/projects/luciddb/files/luciddb/luciddb-0.9.2/ and you can find installation instructions here.

DynamoDB will have the same core database, etc. So, from a raw feature/function perspective what you download and see with LucidDB will be what you get in DynamoDB. DynamoDB will have an administration UI to make things like setting up foreign servers, managing users, etc easier. And lots of other cool new features on the longer term roadmap, which if when we get a website would be a great place for that to go!

Until then, use the open source project, LucidDB. I think you’ll like it!

LucidDB: DynamoBI is running with it

I can think of no better analogy than that of a multi leg race. You know, the races where one sprinter runs as fast as they can, before passing the baton to the next sprinter.

200910240932

First it was Broadbase.
Second it was LucidEra.
Third it was Eigenbase / LucidEra / SQLstream (joint development w/ Eigenbase).

Having purchased commercial rights from LucidEra it’s ours to run with now, alongside Eigenbase and SQLstream.

LucidDB has been described as the “best database no one ever told you about.” That stops today (the telling part, not the best part). Dynamo Business Intelligence Corp will take this great technology to a wider audience and we’ll be telling EVERYONE about it!

Over time, the exceptional features of this open source project will come to light (column store, bit map idxs, drop in java based user plugins, transparent remote JDBC data access, etc). I think it is important to acknowledge how LucidDB arrived to where it is today.

LucidDB is built by smart smart people (people wayyyy smarter than me!). People who’ve written parallel execution engines in Oracle. People who’ve developed Bitmap IDX implementations and helped file those patents. The heritage of LucidDB starts at Broadbase; LucidEra purchased it and brought it to Eigenbase. Eigenbase, and it’s sponsoring companies, have most claim to its current state. Their stewardship and ongoing evolution of the project is a testament to their talents and commitment to open source development. When you pick up LucidDB/DynamoDB and get your first “Ahhhh Cool! 10x Faster than my current database” you have LucidEra/SQLstream/Eigenbase devs to thank. John V. Sichi (lead and main project sponsor), Tai Tran, Julian Hyde, Rushan Chen, Zelaine Fong, Sunny Choi, Steve, Marc, Richard, Hunter, Edan, Damian, Boris, Benny, Stephan, Oscar, …. and the list goes on and on and on. Some of these people will be helping (in small and big ways) with the new company which is great for customers knowing that the people that wrote this stuff will be helping them be successful!

What’s the plan?

  • Open Source.
    Lots of it. Any readers of this blog, or who know me in general, will know I’m a “burn the boats,” open source kind of guy. We’ll be creating some new projects to make using the features/functions already in LucidDB easier. We’ll also be adding new features, which will make their way back into the LucidDB mainline.
  • Commercial in Name Only.
    Mainline DBMS enhancements and development continue, and will continue to be, in LucidDB (Eigenbase). New projects will be available under an OSI approved license. DynamoDB is the prepackaged, assembled, UI included distribution built for customers/evaluators that we’ll offer support on. Should be as easy as we can possibly make it to evaluate, purchase, and use.
  • In Progress.
    We’ve let the announcement ahead of having our website built, or having completed our own DynamoDB QA’ed build. Our open source roots guide us to an “early and often” approach and we’re taking that approach here. Be patient with us as we roll out the business bit by bit over the next few months. Our #1 priority: establish our support/build/qa infrastructure and get an already great piece of software into hands of people who can benefit from it. Hint: If you’ve ever done a star schema on MySQL you need to talk to us!

One thing I am personally looking forward to is getting to work even more extensively with everyone involved at Eigenbase, including the very talented devs at SQLstream (who produce the best real time analytics/integration engine available).

Feel free to join up in taking LucidDB to a whole new level: Download LucidDB and give it a go yourself, since we just released a new version (0.9.2) yesterday! I believe, like others have already mentioned, adding a bit of commercial support behind an already great piece of software is a winning combination!

Drop a line on through to me if you’re interested in getting involved early on (as a charter customer, developer, user, etc). ngoodman at bayontechnologies (with the .COM).

What if Hot Dog vendors sold you Power Tools?

Well, since it’s been topical, and I can never resist an urge to discuss open source licensing…

What do you buy from a Hot Dog Vendor? Hot Dogs, duh!

What do you buy from an Office Supplies Vendor? Office Supplies, duh!

What do you buy from (most) Open Source Vendors? Proprietary Software, … huh?

Let’s do this another way…

What do you get if you buy Yellow Products? Products that are Yellow, duh.

What do you get if you buy Enterprise Software Products? Products that are fit for use by Enterprises, duh!

What do you get if you buy (most) Open Source Products? A proprietary product built on an open source project, … huh?

If you want to be called a “insert term here” vendor should sell “insert term here.” Otherwise you aren’t really “vending” it, you’re just using it as part of a strategy, marketing, development method, etc. Which, in my opinion, is what open source is: A way to develop and distribute software, not what you are selling. Very few of the open source companies actually sell an “Open Source Product.” They sell a proprietary one and services built on top of a great open source project, aka Open Core.

Most “Open Core” companies should simply be defined as “Software Companies with exceptional Open Source development models.” You can not purchase an “open source product” from an Open Core company. You can purchase their proprietary product on top of the open source project, but there is no product you can buy that is “open source” from most Open Core companies.

See the difference? Product and Project are not the interchangable. Vendor and “Model/Company” are not interchangable.

Don’t get me wrong – I’m no enemy of Pentaho by any means; quite the opposite. Just last week I wholeheartedly recommended to a customer they renew their EE subscription based ONLY on the new Pentaho Analyzer (which is GREAT, btw)!  It doesn’t negate the value they sell to customers, Open Core companies still deliver exceptional value.  I don’t call in to question the validity of the Open Core model and it’s mutual benefit for those involved (as James points out consistently in his BeeKeeper).

Open Core Companies just shouldn’t be surprised when people experience cognitive dissonance when they buy a proprietary product without an open source license from an “open source vendor.”

Pentaho Solutions Book

I wonder how many people can we get looking like this?

In all seriousness, the book looks great.  Will bring it on my next extended plane ride – already saw a brilliant diagram (Figure 9-2 on page 234) that perfectly shows PDIs “streaming” architecture.  So, Amazon apparently was trying to exceed expectations (you’ll get by 9/11) but here I have it in hand.

Will put together a more thoughtful review and post it here for those that might be on the fence about buying it.  Based on my initial look, I’m sure the book is worth well beyond its price as a practical guide for using Pentaho.