Well, I noticed

June 26th, 2007

My workday goes much smoother because I listen to a variety of Online Radio stations. Today they all went silent or played public awareness campaigns from SaveNetRadio.org. People have been wondering if anyone will notice. Well, like the title says: I noticed!

I don’t know all the mechanics, but it comes to this. A lot of these small, boutique-ish online radio stations will shut down because the cost structure of the compensation will be, in their opinion, unsuitable.

I, for one, being a proponent of open content, software, and standards think there must be some underlying disconnect between the “Copyright Royalty Board” and broadcasters.

These small, hobbyist online broadcasters are part of larger shift in broadcasting/economies. Web 2.0-ish user generated content and participatory systems of consumer and producer.

The CRB probably needs to take another look at what it’s doing to see if it’s just trying to hang on to old ideologies in a new world.

I support Net Radio. :) You should too.
SaveNetRadio.org

Pet Peeve: EST != EDT

June 22nd, 2007

I work with people all over the country and the world. What that means is that we often schedule meetings, calls, webex meetings, remote consulting sessions, etc. Lacking some great shared calendar in the cloud that we can use to do this adhoc (I’m sure there’s some web 2.0 startup who does this so please comment if you know of something GOOD) this means that people email and put suggested and adjusted times in emails.

For instance, just yesterday, I received the following email:

The regular 10am EST XYZ meeting tomorrow is cancelled until further notice.

What’s the issue with this email? Well, we don’t have a 10am EST meeting. We have a meeting scheduled at 10am Eastern (ie, when the clock in the eastern time zone hits 10am during the summer months).

EDT and EST have VERY SPECIFIC timezone offsets.

EDT = UTC - 4
EST = UTC - 5

I use generally, and think many others also use “Eastern” to refer ONLY to local time. Ie, what the clock on the wall says in New York regardless of EDT/EST.

Let’s take the above example:

  • 10:00 EST on June 22 (someone sends an email requesting a meeting)
  • 10:00 EST = 13:00 UTC (given the definition of EST, with an offset of -5 hours)
  • 13:00 UTC = 11:00 EDT (ie, makes sense right, 10:00 EST = 11:00 EDT)
  • 11:00 Eastern = 10:00 EST (on June 22 when New York is in EDT the actual meeting time)

Obviously you assume that someone requesting a meeting for 10am EST on day that falls on EDT was ACTUALLY requesting a meeting at 10am EDT. However, why bother doing that?

My suggestion to people that can’t keep it all straight:

Use Eastern/Pacific instead of EDT/PSTs. Eastern/Pacific is clear that it’s local time but you haven’t confused it by requesting an incorrect time.

Kettles secret in-memory database

June 20th, 2007

Kettles secret in-memory database is

  1. Not actually secret
  2. Not actually Kettles

There. I said it, and I feel much better. :)
In most circumstances, Kettle is used in conjunction with a database. You are typically doing something with a database: INSERTs, UPDATEs, DELETEs, UPSERTs, DIMENSION UPDATEs, etc. While I do know of some people that are using Kettle without a database (think log munching and summarization) a database is something that a Kettle developer almost always has at their disposal.

Sometimes there isn’t a database. Sometimes you don’t want the slowdown of persistence in a database. Sometimes you just want Kettle to just have an in memory blackboard across transformations. Sometimes you want to ship an example to a customer using database operations but don’t want to fuss with database install, dump files, etc.

Kettle ships with a Hypersonic driver, and therefore, has the ability to create an in memory database that does (most) everything you need database wise.

For instance, I’ve created two sample transformations that use this in-memory database.

The first one, kettle_inprocess_database.ktr, loads data into a simple table:
200706202230

The second one, kettle_inprocess_database_read.ktr, reads the data back from that simple table:
200706202235

To setup the database used in both of these transformations, which has no files, and is only valid for the length of the JVM I’ve used the following Kettle database connection setup.

I setup a connection named example_db using the Generic option. This is so that I have full control over the JDBC URL.
200706202227

I then head to the Generic tab and input by URL and Driver. Nothing special with the driver class, org.hsqldb.jdbcDriver that is just the regular HSQLDB driver name. The URL is a little different then usual. The URL provided tells hypersonic to use a database in-memory with no persistence, and no data fil.e”
200706202225

Ok, that means the database “example_db” should be setup for the transformations.

Remember, there is NOTHING persistent about this database. That means, every time I start Kettle the database will have no tables, no rows, no nothing. Some steps to run through this example.

  1. Open kettle_inprocess_database. “Test” the example_db connection to ensure that I / you have setup the in-memory database correctly.
  2. Remember, nothing in the database so we have to create our table. In the testing table operator, hit the SQL Button at the bottom of the editor to generate the DDL for this smple table.
  3. Run kettle_inprocess_database and verify that it loaded 10 rows into testingtable.
  4. Run kettle_inprocess_database_read and verify that it is reading 10 rows from the in-memory table testingtable.

I should note that using this approach isn’t always a good idea. In particular there’s issues with memory management, thread safety, it definitely won’t work with Kettles clustering features. However, it’s a simple easy solution for some circumstances. Your mileage may vary but ENJOY!

Kettle ETL Jobs over the Internet

May 28th, 2007

Kettle recently switched over from accessing the file system via standard Java libraries to using the Apache VFS libraries. So what does that mean? Why does that matter? Well, it really opens up a whole slew of deployment options, and provides even more options for managing Kettle code. You can now manage your actual ETL jobs and transforms in Zipfiles, on Web Servers, FTP servers, WebDav locations, etc. Basically, you can making Kettle even thinner than it is right now.

For instance, you can now run Kettle ETL jobs with ONLY the base Kettle installation and a remote URL. Consider the following example.

I have two Kettle jobs:

http://www.nicholasgoodman.com/kettle/parentjob.kjb
200705281727

-AND -

http://www.nicholasgoodman.com/kettle/subjob.kjb
200705281729

The parent job, does something very simple: It executes the subjob.kjb using relative addressing. You can use the well known and supported ${Internal.Job.Filename.Directory} variable so that you don’t have to hard code the physical location of another jobs/transform/datafile/etc. In this case, it doesn’t matter if the subjob is on the local file system or on a webserver. The dialog to setup the relative addressing looks like this:

200705281733

Now, using nothing but the standard Kettle 2.5.0 download I can execute these two jobs without any client side ETL Jobs or Transforms.

./kitchen.sh -file=http://www.nicholasgoodman.com/kettle/parentjob.kjb

17:42:30,075 INFO [Kitchen] Kitchen - Start of run.
17:42:30,644 INFO [Kettle] Kettle - Reading repositories XML file: /Users/ngoodman/.kettle/repositories.xml
17:42:30,647 ERROR [Kettle] Kettle - Error opening file: /Users/ngoodman/.kettle/repositories.xml : java.io.FileNotFoundException: /Users/ngoodman/.kettle/repositories.xml (No such file or directory)
ERROR: No repositories defined on this system.
2007/05/28 17:42:30:695 PDT [INFO] DefaultFileReplicator - Using “/tmp/vfs_cache” as temporary files store.
17:42:31,991 INFO [Thread[parentjob (parentjob (Thread-2)),5,main]] Thread[parentjob (parentjob (Thread-2)),5,main] - Sleeping: 0 minutes
17:42:31,992 INFO [parentjob] parentjob - Starting entry [subjob]
17:42:32,133 INFO [Thread[subjob (subjob (Thread-3)),5,main]] Thread[subjob (subjob (Thread-3)),5,main] - Sleeping: 0 minutes
17:42:32,134 INFO [subjob] subjob - Starting entry [Dummy]
17:42:32,135 INFO [subjob] subjob - Finished jobentry [Dummy] (result=true)
17:42:32,233 INFO [parentjob] parentjob - Starting entry [Dummy]
17:42:32,234 INFO [parentjob] parentjob - Finished jobentry [Dummy] (result=true)
17:42:32,234 INFO [parentjob] parentjob - Finished jobentry [subjob] (result=true)
17:42:32,235 INFO [Kitchen] Kitchen - Finished!
17:42:32,235 INFO [Kitchen] Kitchen - Start=2007/05/28 17:42:30.630, Stop=2007/05/28 17:42:32.235
17:42:32,235 INFO [Kitchen] Kitchen - Processing ended after 1 seconds.

Just another nice feature that allows even more interesting ways to manage a deployment of ETL jobs and Transforms. Great work Kettle team!

Roland Bouman on BI4DBA

April 25th, 2007

Real time. Roland has built a cool Pentaho application that collects data from MySQL performance tables and metadata views and shows reports on this performance, errors, etc. Cool stuff for DBAs.

Picture, total crap from the mobile
04252007

MySQL Archive Tablespace for FACTs

April 18th, 2007

I’m visiting a Pentaho customer right now whose current “transaction” volume is 200 million rows per day.  Relatively speaking, this puts their planned warehouse in the top quintile of size.  They will face significant issues with load times, data storage, processing reliability, etc.  Kettle is the tool they selected and it is working really well.  Distributed record processing using Kettle and a FOSS database is a classic case study for Martens scale out manifesto

This organization doesn’t have unlimited budget.  Specifically, they don’t have a telecom type budget for their telecom like volume of data.  One of the issues that has come up with their implementation has been the tradeoff between space, and keeping the base level fact records.  For example, at 200 million / day and X bytes per fact you start to get into terabytes of storage quickly.  It was assumed, from the start of the project, only summary level data could be stored for any window of time exceeding 10 days or so. 

The overall math is sound. 

Size per record (S) x number of records per day (N) = size per day of data growth (D)

In this equation, there’s really not much we can do, if we want to keep the actual base level transaction, about the number of records per day.  N becomes a fixed parameter in this equation.  We do have some control over the S value, which is mostly about what this blog is about.

Can we reduce the size of S by such an amount that D becomes more realistic?  The answer is the ARCHIVE engine in MySQL.

I created the DDL for a typical fact record.  Few Dimension IDs and a few of numeric values.

CREATE TABLE test_archive
(
  DIM1_ID INT
, DIM2_ID INT
, DIM3_ID INT
, DIM4_ID INT
, DIM5_ID INT
, NUMBER1 FLOAT
, NUMBER2 FLOAT
, NUMBER3 FLOAT
, NUMBER4 FLOAT
) engine=ARCHIVE
;

I did this in MyISAM as well

CREATE TABLE test_myisam
(
  DIM1_ID INT
, DIM2_ID INT
, DIM3_ID INT
, DIM4_ID INT
, DIM5_ID INT
, NUMBER1 FLOAT
, NUMBER2 FLOAT
, NUMBER3 FLOAT
, NUMBER4 FLOAT
) engine=MyISAM
;

I used a simple Kettle transformation to populate both the these tables with 1 million rows of data. It wasn’t scientific but the INSERT performance of the Archive and MyISAM tables were very similar (within 10% of the throughput of rows).

So now we have a million rows, with a reasonable FACT record format.  How much space do these 1 million rows require to store?

+————–+———+—————+————+
| table_name | engine | total_size_mb | table_rows |
+————–+———+—————+————+
| test_archive | ARCHIVE | 2.63 | 1000000 |
| test_myisam | MyISAM | 36.24 | 1000000 |
+————–+———+—————+————+

The Archive table uses an order of magnitude LESS space to store the same set of FACTS.  What about query performance?  I’ll have to do another blog on a more scientific approach but the anecdotal query performance on typical OLAP queries (select sum() from fact group by dim1_id) seemed related (less than a 15% difference). 

Let’s be clear here, one MUST have a good aggregate table strategy so the fact records are RARELY accessed because the performance will not rock your world.  However, this is the case with these volumes anyhow.  Summaries and Aggregates have to be built for the end user to have a good experience on the frond end.

Archive engine is strange. INSERT/SELECT only.  You can’t do ANYTHING with it, except drop and recreate it.  For historical fact loads on “settled” DW data segmented on a daily basis this is usually not an issue.  No transactions need, no DML needed.  I’d also like to see how this compares to “packed” MyISAM tables at some point.  I’m guessing I wouldn’t expect to see it beat the compression in terms of storage space, but there’s some nice things you can do with MyISAM tables (for instance, MERGE).

Back of the napkin calculation to handle BIG data volume:

2.63 MB / million * 200 per day = 526 MB / day
526 MB / day * 365 days = 187.5 GB

Anyone else have fun experiences to share with the Archive engine?

Side Note:
  I did all this testing on a VNC Server enabled EC2 instance I use for testing.  Fire it up, VNC in, do my testing, shut it down.  I LOVE EC2!

Why I don’t have a .sig on email

April 11th, 2007

One of my pet peeves is an email thread that grows 100 lines with every “Sounds good to me” reply. You know what I’m talking about.

10 screens of text, with about 1 screen of actual content/conversation.

All these logos and titles, fax numbers, clever logans and sayings, etc. AHHHH….

It’s a networked world, it doesn’t have to be on EVERY SINGLE EMAIL RESPONSE. If you want to get in touch with me, you can google me and immediately find my site, etc.

I’m Twitter’ed, LinkedIn, YahooMessenger, AIM, MSN, skype, etc. I’m easy to get a hold of, you don’t need to have 10 copies of ALL MY CONTACT INFO in an email.

Why “web 2.0″ works

April 2nd, 2007

There’s an intersection of value at
a) products that are web service and plugin enabled
b) companies that provide interesting “net effect” services

For instance, tonight my hosting provider, Dreamhost, emailed and said that my feed was being hit so extensively that I was causing service interruptions on the other accounts on the machine. 

First, to the other sites, and I have no idea who you are but I’m sorry!

Second, I was able to leverage a web 2.0 - ish service and plugin to instantly alleviate the pain (with a cached redirected version of my feed) and now I’m getting some cool extra stats on my RSS (who/how/what).  All in < 30 minutes.

Because my blogging software works in the networked world, things happen easier and more naturally than me having to hack around a bunch of code special scripting/patching on my website.  Plugins/Service/etc.  It’s a grand new world.

PS - I guess I’m over that whole “bloggers block” thing.  Spouting out useless crap on my blog again.

Meet me in San Francisco: Pentaho Training

March 26th, 2007

I am going to be the instructor for the Operational Business Intelligence course in April.  This particular course digs into the reporting tools, scheduling, processes, design, etc.  Basically, everything you need to build a reporting-centric solution using Pentaho.

Operational Business Intelligence - San Francisco, CA

303 Twin Dolphins Drive
Suite 600
Redwood City, CA  94065
USA
Monday, April 23, 2007 - Thursday, April 26, 2007

I’m curious: if you signup for the course having learnt about it here, put in the notes that was the case.  I’ll bring a special Pentaho shwag gift to anyone who signs up from this blog readership in the next two weeks!

PS - I know I know.  Redwood City isn’t technically SFO but … whatever!

Honest Reflection: Am I done blogging?

March 20th, 2007

Has anyone ever experienced “Bloggers Block?”  You know, where you have plenty of things to write about, but are unable to select a topic and put some metaphorical pen to paper?  I’ve been experiencing this since my return from extended holiday.

I arrived back and started work again at Pentaho.  There’s been a bunch of developments at Pentaho (more open source features, key customer successes, etc), there’s been some interesting open source moves (licenses, alliances, etc), some interesting BI moves (vertica, hyperion acquisition, etc).  Basically, I have a Blog Backlog of probably 25 or so juicy, page plusers.  Things I would dig into, opine about, and hopefully help disseminate some useful information.

So… What’s the problem? 

I just can’t.  I’ve sat down and I look at my blog client and I just can’t seem to pick one, and write it.  None of it seems worthy enough to be the first, or most interesting, or … Pick any reason, I’ve come up with it for why not to write on a subject. 

Now that I’ve committed the cardinal sin of Apologetic Blogging (I aboslutely HATE reading blog posts entitled:  Sorry I haven’t posted in a while) perhaps I should just turn it in.  In fact, with so many people blogging these days it’s almost refreshing to hear “I don’t blog anymore.” 

So I ask myself the question honestly: Am I done blogging?  Is it worth the time?  Is it worthwhile?  Do I receive enough enjoyment from the writing to continue?

I don’t know.  However, implicitly my next (possible) blog will answer that question, yes?