Monthly Archives: November 2009

Asking this question means you don't get BI market

In almost every technology company, if you’re explaining your business model savvy technology executives ask the question:

Who are you selling this too?  What’s his/her title, where does he work?  What’s the size of their company?

It’s a question that helps the questioner understand, and the responder clarify exactly who is buying the product.  This is critical for a business!  Is it the System Administrator manager who is looking for his DBAs to coordinate their efforts (groupware for DBAs)?  Is it a CRM system that the business users are primarily evaluating for use (salesforce/sugarcrm) but requires huge IT investment for configuration/integration??  For a software or IT services provider, deciding WHO you sell to (Business Users or IT) is hugely important!

Asking this question when that technology or services is in Business Intelligence is just plain useless though. Business Intelligence is always a mix of the two.  IT?  Sometimes they’re the ones buying, but never without HUGE amounts of time spent with the business side (casual report developers and business analysts).  Business Users buying tableau, and coordinating parts of the purchase or data access with IT?  Yup.   Analysts embedded with business teams buying SAS/SPSS, through an IT purchasing process?  Sure thing.

Business Intelligence is always sold to two groups at once which makes it a tricky thing to sell.  Anyone reading this, consider how much tension you’ve observed between your IT/Business groups.  Trying to get dead in the middle on this is a tricky proposition.

Business Intelligence sales guy earn their money for sure!

DynamoDB: Time Dimension table with MERGE

So, even with my disclaimer note on the last blog, the DynamoDB developers slapped me around a bit for suggesting using a view for a Time Dimension. The Time Dimension is the most important dimension table and should be an actual table, not a view. Creating the table allows us to perform all kinds of optimizations like star joins, bitmap indexes on attributes, etc. Probably wouldn’t be that big of a deal for a tiny fact table (< 5million records) but you’ll want CREATE TABLE if you want good performance.

Good news is, that we can use our exact same table function (with fiscal year offset)

select * from table(applib.fiscal_time_dimension (2000, 1, 1, 2009, 12, 31, 3))

to populate and keep our Time Dimension TABLE up to date.

If you use a TABLE, it’s 2 steps:

  1. CREATE TABLE : “dim_time”
  2. POPULATE TABLE : “merge into dim_time”

We’ll be using another great tool in the DynamoDB / LucidDB toolkit, the MERGE statement. The MERGE statement is a logical UPSERT. It checks to see if key is already present. If it is, we UPDATE the table. If it isn’t, we INSERT it into the table. I’ll go into more detail at some point in the future as MERGE is crucial for keeping dimensions up to date.

Let’s create our Time Dimension table:

create table dim_time (
FISCAL_YEAR_END_DATE DATE
, FISCAL_YEAR_START_DATE DATE
... ABBREVIATED ...
, TIME_KEY DATE
, TIME_KEY_SEQ INTEGER
,constraint dim_time_pk primary key (day_from_julian));

NOTE: We’ve abbreviated the statements, but all the columns are used in the actual scripts. We also should add bitmap indexes on YR, MONTH, etc columns.

We’ve now got a TABLE that matches the VIEW we created in the previous blog. We’ve made day_from_julian as our PK, and we’ll use this date as our key for the MERGE statement. We can run this query as many times as we like and it will always just keep our “dim_time” table up to date.

merge into dim_time using (select * from
      table(applib.fiscal_time_dimension (2000, 1, 1, 2010, 12, 31, 3))) src
on dim_time.day_from_julian = src.day_from_julian
when matched then UPDATE set
FISCAL_YEAR_END_DATE=src.FISCAL_YEAR_END_DATE
,FISCAL_YEAR_START_DATE=src.FISCAL_YEAR_START_DATE
... ABBREVIATED ...
,TIME_KEY=src.TIME_KEY
,TIME_KEY_SEQ=src.TIME_KEY_SEQ
when not matched then INSERT
(FISCAL_YEAR_END_DATE
 , FISCAL_YEAR_START_DATE
... ABBREVIATED ...
 , TIME_KEY
 , TIME_KEY_SEQ)
values(
src.FISCAL_YEAR_END_DATE
 , src.FISCAL_YEAR_START_DATE
... ABBREVIATED ...
 , src.TIME_KEY
 , src.TIME_KEY_SEQ);

DynamoDB: Built in Time Dimension support!

DynamoDB (aka LucidDB) is not just another column store database. Our goal is being the best database for actually doing Business Intelligence; while that means being fast and handling large amounts of data there’s a lot of other things BI consultant/developers need. I’ll continue to post about some of the great BI features that DynamoDB has for the modern datasmiths.

First feature to cover that’s dead easy, is the built in ability to generate a time dimension, including a Fiscal Calendar attributes. If you’re using Mondrian (or come to that, your own custom SQL on a star schema) you need to have a time dimension. Time is the most important dimension! Every OLAP model I’ve ever built uses one! It something that you, as a datasmith will need to do with every project; that’s why we’ve built it right into our database.

Here’s a dead simple way to create a fully baked, ready to use Time Dimension to use with Mondrian.

-- Create a view that is our time dimension for 10 years, with our
-- Fiscal calendar starting in March (3)
create view dim_time as select * from
table(applib.fiscal_time_dimension (2000, 1, 1, 2009, 12, 31, 3));

OK, that’s it. You’ve created a Time Dimension! * see NOTE at end of post.

So, we’ve created our time dimension, complete with a Fiscal calendar for 10 years in a single statement! Awesome – but what does it contain?

-- Structure of new time dimension
select "TABLE_NAME", "COLUMN_NAME", "DATATYPE" from sys_root.dba_columns
where table_name = 'DIM_TIME';
+-------------+---------------------------------+-----------+
| TABLE_NAME  |           COLUMN_NAME           | DATATYPE  |
+-------------+---------------------------------+-----------+
| DIM_TIME    | FISCAL_YEAR_END_DATE            | DATE      |
| DIM_TIME    | FISCAL_YEAR_START_DATE          | DATE      |
| DIM_TIME    | FISCAL_QUARTER_NUMBER_IN_YEAR   | INTEGER   |
| DIM_TIME    | FISCAL_QUARTER_END_DATE         | DATE      |
| DIM_TIME    | FISCAL_QUARTER_START_DATE       | DATE      |
| DIM_TIME    | FISCAL_MONTH_NUMBER_IN_YEAR     | INTEGER   |
| DIM_TIME    | FISCAL_MONTH_NUMBER_IN_QUARTER  | INTEGER   |
| DIM_TIME    | FISCAL_MONTH_END_DATE           | DATE      |
| DIM_TIME    | FISCAL_MONTH_START_DATE         | DATE      |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_YEAR      | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_QUARTER   | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_NUMBER_IN_MONTH     | INTEGER   |
| DIM_TIME    | FISCAL_WEEK_END_DATE            | DATE      |
| DIM_TIME    | FISCAL_WEEK_START_DATE          | DATE      |
| DIM_TIME    | FISCAL_DAY_NUMBER_IN_YEAR       | INTEGER   |
| DIM_TIME    | FISCAL_DAY_NUMBER_IN_QUARTER    | INTEGER   |
| DIM_TIME    | FISCAL_YEAR                     | INTEGER   |
| DIM_TIME    | YEAR_END_DATE                   | DATE      |
| DIM_TIME    | YEAR_START_DATE                 | DATE      |
| DIM_TIME    | QUARTER_END_DATE                | DATE      |
| DIM_TIME    | QUARTER_START_DATE              | DATE      |
| DIM_TIME    | MONTH_END_DATE                  | DATE      |
| DIM_TIME    | MONTH_START_DATE                | DATE      |
| DIM_TIME    | WEEK_END_DATE                   | DATE      |
| DIM_TIME    | WEEK_START_DATE                 | DATE      |
| DIM_TIME    | CALENDAR_QUARTER                | VARCHAR   |
| DIM_TIME    | YR                              | INTEGER   |
| DIM_TIME    | QUARTER                         | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_OVERALL            | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_IN_YEAR            | INTEGER   |
| DIM_TIME    | MONTH_NUMBER_IN_QUARTER         | INTEGER   |
| DIM_TIME    | MONTH_NAME                      | VARCHAR   |
| DIM_TIME    | WEEK_NUMBER_OVERALL             | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_YEAR             | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_QUARTER          | INTEGER   |
| DIM_TIME    | WEEK_NUMBER_IN_MONTH            | INTEGER   |
| DIM_TIME    | DAY_FROM_JULIAN                 | INTEGER   |
| DIM_TIME    | DAY_NUMBER_OVERALL              | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_YEAR              | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_QUARTER           | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_MONTH             | INTEGER   |
| DIM_TIME    | DAY_NUMBER_IN_WEEK              | INTEGER   |
| DIM_TIME    | WEEKEND                         | VARCHAR   |
| DIM_TIME    | DAY_OF_WEEK                     | VARCHAR   |
| DIM_TIME    | TIME_KEY                        | DATE      |
| DIM_TIME    | TIME_KEY_SEQ                    | INTEGER   |
+-------------+---------------------------------+-----------+

-- Let's look at a few rows
select time_key_seq, time_key, yr, month_number_in_year, fiscal_year
, fiscal_month_number_in_year from dim_time;
+---------------+-------------+-------+-----------------------+--------------+------------------------------+
| TIME_KEY_SEQ  |  TIME_KEY   |  YR   | MONTH_NUMBER_IN_YEAR  | FISCAL_YEAR  | FISCAL_MONTH_NUMBER_IN_YEAR  |
+---------------+-------------+-------+-----------------------+--------------+------------------------------+
| 1             | 2000-01-01  | 2000  | 1                     | 2000         | 11                           |
| 2             | 2000-01-02  | 2000  | 1                     | 2000         | 11                           |
| 3             | 2000-01-03  | 2000  | 1                     | 2000         | 11                           |
| 4             | 2000-01-04  | 2000  | 1                     | 2000         | 11                           |
| 5             | 2000-01-05  | 2000  | 1                     | 2000         | 11                           |
| 6             | 2000-01-06  | 2000  | 1                     | 2000         | 11                           |
| 7             | 2000-01-07  | 2000  | 1                     | 2000         | 11                           |
| 8             | 2000-01-08  | 2000  | 1                     | 2000         | 11                           |
| 9             | 2000-01-09  | 2000  | 1                     | 2000         | 11                           |
| 10            | 2000-01-10  | 2000  | 1                     | 2000         | 11                           |
+---------------+-------------+-------+-----------------------+--------------+------------------------------+

Generating the Time Dimension is accomplished using DynamoDBs ability to include Java based UDF Table Functions. Table functions are really powerful – they allow a BI developer to write custom functions that output a “table” that can be queried like ANY OTHER TABLE (mostly). Check out the wiki page FarragoUdx if your interested.

And of course: download LucidDB and give it a whirl!

NOTE: To be candid, doing it as a view isn’t the best approach. For anything beyond tiny (5 million +) we should actually create the table, and do an INSERT INTO SELECT * FROM TABLE(fiscal_time_dimension).

Book Review: Pentaho Reporting 3.5 for Java Developers

I have two customers that if they had access to Will Gormans book, Pentaho Reporting 3.5 for Java Developers, they would not have needed me for their project! That’s how good the book is for those who need to embed Pentaho Reporting into their Java application.

The book is certainly geared towards Java developers, and specifically, developers you are trying to simply use the Pentaho reporting library. I’d venture to say that MOST customers should be using Pentaho; in this case, the book is useful as a reference, but the HOWTO past Chapter 3 would probably be lost on many users; except for Chapter 11 (see below).

However, for people trying to embed Pentaho reporting, WOW: THIS IS THE DEFINITIVE RESOURCE. Buy it, RIGHT NOW! The information it contains was locked in just a few peoples minds (Thomas, Bunch of People sitting at the “citadel” in Orlando aka Pentaho Employees, a handful of consultants). Will has unlocked it and I’m glad he did.

Will taught me something new in this book. In fact, I hope this is “new” in 3.5 which was release just a few weeks back. If it’s been around longer than I’m a total dolt. Chapter 11 covers how to add your own custom Expressions/Formulas to Pentaho (including the PRD).

At customer engagements, or when I put on my Pentaho hat and teach their public courses, or custom onsite training, I’m asked all the time: Can I make my own Reporting Functions and plug them into Pentaho Report Designer? Up until WIll showed me how to do it on page 281, I thought this was only possible for Pentaho (the company). Will gives us a step by step guide to add our own “DoMyCustomThing” to the Pentaho Report Designer. Customers can now create their own corporate expressions/functions they can leverage across hundreds of reports.

I’ll keep several copies on my shelf, and give it away to any current/future “embedded Pentaho Reporting” customers. Thanks Will for such a great book!

Dreamhost Uptime Numbers are TERRIBLE!

I don’t care what their marketing stats say, I have my own indepedent verification. I’ve been using Wormly for quite a while monitoring some of my demo sites, and other services that are part of Bayon and part of Dynamo. Since I was already paying for it, I figured I’d turn it loose on this blog (nicholasgoodman.com) and see what the uptime was like.

I always thought Dreamhost was a little skiddish, and my email box finds approximately one email per day with a failure, but i figured they were small, single request failures. Nope. The independent measuring of the uptime of this blog is a CRUDDY, CRAPPY, 97.6%.

200911042116
That’s pathetic! My blog is nothing special, an out of the box WordPress installation backed by their MySQL. I haven’t done any of my own installations, customizations (excepting a theme) and yet my blog uptime is awful. I’ve liked the dreamhost panel; it gives the “technical but uninterested in actually administering their own server” user a lot of power and I’d be willing to tolerate a little downtime (truthfully, anything above 99.5% is OK with me). But 97% uptime? Shyeah… Time to start looking.

Anyone have any suggestions for good WordPress / PHP / MySQL hosts? WIlling to pay top dollar and I’ll bring with me registrations for about 25 domains.

Instant Relief from MySQL Reporting Queries: Incremental Updates

Yesterday, I covered how you can do an initial “replication” of data from MySQL to DynamoDB and how this can improve performance, and save storage space. The follow on question becomes:

That’s Great Nick. But how do I do keep my data up to date?

We’ve got data in our Airline Performance dataset through 31-DEC-2007. I loaded 1 year, all of 2007, for the previous example. What happens when the FAA publishes their 2008 January results, and we’ve loaded the new months worth of data into MySQL?

MySQL:

select count(*) from otp.ontime; 8061223
select count(*) from ontime where FlightDate > ‘2007-12-31′; 605765
select count(*) from ontime where FlightDate <= ‘2007-12-31′; 7455458

DynamoDB:

select count(*) from FASTER.”ontime”; 7455458

So, we’ve added approximately 600k new records to our source system that we don’t have in our reporting system. How do we incrementally insert these records and get just the 600k new rows into our DynamoDB reporting instance?

Easy Easy Easy.

We’ve already done all the work, all we have to do is simply get records we haven’t processed yet! Should take just a few minutes to get our current table “up to date” with the one over in MySQL.

DynamoDB:

select max(“FlightDate”) from FASTER.”ontime”; 2007-12-31
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime” where “FlightDate” > DATE ‘2007-12-31′; 605765

In other words, let’s select from MySQL any records whose date is beyond what we have currently (2007-12-31).

select count(*) from FASTER.”ontime”; 8061223
select count(*) from FASTER.”ontime” where “FlightDate” > DATE ‘2007-12-31′; 605765

MySQL:
While the DynamoDB INSERT statement was running, the following SQL was being run on MySQL.

show processlist shows a SQL session with the following SQL:
SELECT * FROM `ontime` WHERE `FlightDate` > DATE ‘2007-12-31′;

A single SQL statement (insert into select * from table where date > last time) has you up to date for reporting! Long term we may look to work with Tungsten to be able to keep our data up to date using replication bin log records but for now, this simple pull based approach.

Instant Relief from Slow MySQL Reporting Queries using DynamoDB

Here’s the scenario. You’ve got a table in MySQL for reporting that has a few million rows, and is denormalized for reporting. You’ve got a Pentaho Report that is querying this MySQL table. You have two problems with the current report.

  1. Your users are complaining that the query is slow, and they have to wait around for longer than they’d like to see their report. (approx 40s)
  2. Your DBAs are cranky because they see the size of this table is getting bigger. (approx 1.8GB)

MySQL is fundamentally designed to be an OLTP database and while it does a fantastic job at that, its data warehouse features were built as “bolt on” additions. Can it be used for BI? Absolutely, I’ve used it a many customer sites. Does DynamoDB provide a better set of features/capabilities for doing BI? We think so! Are they both 100% open source? You bet;why not choose the right tool for the right job then?

DynamoDB (aka LucidDB) is a “purpose built for BI” database. What does that mean? Well, I’ll be blogging about a lot of features that speak to our philosophy of a complete “BI Database” not just a fast one. One of the features that makes LucidDB complete, and not just a drag racer, is its ability to connect to remote data sources via JDBC and retrieve data. If you’re doing simple table replications, you don’t have to use an ETL tool, or do export or imports, or LOAD DATA INFILEs, etc. Our ability to connect to remote databases and access them as “remote tables” makes retrieving data into DynamoDB as easy as “insert into mytable select * from remote_table.”

Back to our original issue with our current MySQL

Our report is slow, and our database is big. How slow? Well, not really that bad, but at about 40s per query run that’s enough to tempt your business analyst to go fetch a coffee instead of continuing his work. How big? Well, not really that big, but at about 1.8GB it’s starting to get non trivial in terms of tuning the I/O etc.

Our goal is to improve both using DynamoDB; we’ll leave MySQL as our main OLTP application. We’re not trying to replace it – in fact, we’ll embrace MySQL as the system of record and simply “slurp/report” off this table in a separate reporting environment.

It’s a two step process.

  1. Connect from DynamoDB to MySQL using a JDBC connector, access the remote table, and draw over the data using a simple INSERT statement.
  2. Change our Pentaho Report to use the DynamoDB JDBC connector instead of MySQL.

Our Pentaho Report is based on the following SQL

SELECT t.Carrier as “CARRIER”,
c as “C”, c2 as “C2″, c*1000/c2 as “C3″ FROM
(SELECT Carrier, count(Carrier) AS c FROM ontime
WHERE DepDelay>10 GROUP BY Carrier) t JOIN
(SELECT Carrier, count(Carrier) AS c2 FROM ontime
GROUP BY Carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

200911022216
This takes approximately 40s to run on MySQL database running the same machine.

Step 1: Connect, and load the data into our DynamoDB table.

– Create DynamoDB reporting table first
create schema faster;

create table faster.”ontime” (
“Year” int,
“Quarter” tinyint ,
“Month” tinyint ,
….. Abbreviated for Brevity ….
“Div5TailNum” varchar(10)
);

— Get access the MySQL table OnTime in the OTP schema on host localhost
create schema MYSQL_SOURCE;
set schema ‘MYSQL_SOURCE';

CREATE SERVER MYSQL_REMOTE_SOURCE FOREIGN DATA WRAPPER
sys_jdbc OPTIONS (
driver_class ‘com.mysql.jdbc.Driver’,
url ‘jdbc:mysql://localhost/otp?useCursorFetch=true’,
user_name ‘root’,
password ‘easy’,
fetch_size ‘1000’,
table_types ‘TABLE’,
schema_name ‘otp’);

import foreign schema OTP from server MYSQL_REMOTE_SOURCE into MYSQL_SOURCE;

— Load DynamoDB table from MySQL database directly
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime”;

Notice that last statement. You don’t have to export to intermediate files, or use an ETL tool (not that that’s bad, I’m a big fan of ETL tools!). You can use good old fashioned SQL to get data from a remote database into DynamoDB.

Step 2: Change the Pentaho Report to use the new connection.

We open up our report and change our connection from MySQL

200911022234
to DynamoDB

200911022236
NOTE: Until we finish our QA’ed builds we’re using LucidDB driver instead of DynamoDB but they are, one and the same.

We make some minor adjustment to the SQL (quoting some tables/etc) and rerun our query and Voila, our report runs in 10s down from 40s, an improvement of 400%.

How about storage? Our storage report shows that DynamoDB is using only .3 GB to store the same 7 Million records as compared to MySQL at 1.8GB, or 1/6 of the storage.

Not a bad investment of a few minutes of time, I’d say. DynamoDB (LucidDB) takes just a few minutes to install, and because of its focus on BI you should find things like retrieving data from remote data sources easy, and effective. Let’s be truthful here as well; once you speed up a report by 400% and reduce its storage by 6x your boss will be calling you a dynamo.

Notes: Full set of scripts posted here: mysql_relief.zip. Original queries and dataset from Vadim at MySQLPerformanceBlog.