For those that weren’t able to attend the fantastic NoSQL Now Conference in San Jose last week, but are still interested in the slides about how people are doing Ad Hoc analytics on top of NoSQL data systems, here’s my slides from my presentation:
Category Archives: Open Source
NoSQL Now 2011: Review of AdHoc Analytic Architectures
PDI Loading into LucidDB
By far, the most popular way for PDI users to load data into LucidDB is to use the PDI Streaming Loader. The streaming loader is a native PDI step that:
- Enables high performance loading, directly over the network without the need for intermediate IO and shipping of data files.
- Lets users choose more interesting (from a DW perspective) loading type into tables. In particular, in addition to simple INSERTs it allows for MERGE (aka UPSERT) and also UPDATE. All done, in the same, bulk loader.
- Enables the metadata for the load to be managed, scheduled, and run in PDI.
However, we’ve had some known issues. In fact, until PDI 4.2 GA and LucidDB 0.9.4 GA it’s pretty problematic unless you run through the process of patching LucidDB outlined on this page: Known Issues.
In some ways, we have to admit, that we released this piece of software too soon. Early and often comes with some risk, and many have felt the pain of some of the issues that have been discovered with the streaming loader.
In some ways, we’ve built an unnatural approach to loading for PDI: PDI wants to PUSH data into a database. LucidDB wants to PULL data from remote sources, with it’s integrated ELT and DML based approach (with connectors to databases, salesforce, etc). Our streaming loader “fakes” a pull data source, and allows PDI to “push” into it.
There’s mutliple threads involved, when exceptions happen users have received cruddy error messages such as “Broken Pipe” that are unhelpful at best, frustrating at worse. Most all of these contortions will have sorted themselves out and by the time 4.2 GA PDI and 0.9.4 GA of LucidDB are released the streaming loader should be working A-OK. Some users would just assume avoid the patch instructions above and have posed the question: In a general sense, if not the streaming loader how would I load data into LucidDB?
Again, LucidDB likes to “pull” data from remote sources. One of those is CSV files. Here’s a nice, easy, quick (30k r/s on my MacBook) method to load a million rows using PDI and LucidDB:
This transformation outputs to a Text File 1 million rows, waits for that to complete then proceeds to the load that data into a new table in LucidDB. Step by Step the LucidDB statements
— Points LucidDB to the directory with the just generated flat file
— LucidDB has some defaults, and we can “guess” the datatypes by scanning the file
CREATE or replace SERVER csv_file_server FOREIGN DATA WRAPPER SYS_FILE_WRAPPER OPTIONS ( DIRECTORY ‘?’ );
— Let’s create a foreign table for the data file (“DATA.txt”) that was output by PDI
>create foreign table applib.data server csv_file_server;
— Create a staging, and load the data from the flat file (select * from applib.data)
CALL APPLIB.CREATE_TABLE_AS (‘APPLIB’, ‘STAGING_TABLE’, ‘select * from applib.data’, true);
We hope to have the streaming loader ready to go in 0.9.4 (LucidDB) and 4.2 (PDI). Until then, consider this easy, straight forward method of loading data that’s high performance, proven, and stable for loading data from PDI into LucidDB.
Example file: csv_luciddb_load.ktr
Pushdown Query access to Hive/Hadoop data
Just in time for the Hadoop Summit, we’ve updated some pieces of LucidDB to be able to do “more” with the Hive driver. We’ve had the ability to connect and query data inside of Hive for about a year already. The initial implementation allowed people to:
- Use Hive for it’s massive scalability, distributed data processing capabilities.
Hive is great at processing huge amounts of data. It’s scales to hundreds of servers, and has a bunch of fantastic features for structured and semi structured data processing, fault tolerance, etc. Hive is a great way to do heavy lifting sorting through petabytes of data to arrive at some interesting, pre-aggregated datasets. - Cache the output of Hive views into LucidDB.
Now when we’re talking about taking the output of Hive views into LucidDB, we’re not talking about SMALL datasets (10k rows) we’re talking about 50 or 100, or 500 million rows. Some might think that number is small (by Hive standards it often is) and others might think that’s big (our entire DW is only 200 million rows). However, LucidDB has provided the ability to draw in data from Hive via easy MERGE/INSERT statements.
You can see some background on this integration that has been functional since August 2010 on Eigenpedia: http://pub.eigenbase.org/wiki/LucidDbHive
Also, a video I recorded last year showing the basic integration: (THIS IS NOT NEW!!!!).
Why the blog now? We’ve done a couple of things over the past while.
- We’ve done some work on LucidDB (yet to be committed and will be a POST 0.9.4 commit) that allows the use of Hives, well, somewhat unique driver. Hive’s driver has a bunch of quirks in terms of metadata, etc that we’re now recognizing and using properly over in LucidDB.
- We’ve updated to the 0.7.0 release. We’re now ready to go with the latest and great Hive features.
- We’ve enabled some pushdowns to work to allow for easier day to day loading of LucidDB tables from Hive, along with a limited workload of Ad Hoc SQL access.
Our vision for Big Data dictates the need for:
- Live, real time, per query access to the Big Data system that is useful and practical (ie, filters, etc).
This means that you need to be able to allow the user, via simple Parameter or simply by hitting a different schema or table access to the live data.
- Easy, full ANSI SQL access to high performance, low latency, aggregated data.
Dashboards need results that come back in seconds, not minutes. LucidDB and the data cached there provide a great “front end” for easily doing real BI work on top of data that sits inside Hive.
We’ve updated our connectors to allow some filtering/projection pushdowns to work with Hive.
Here’s a simple example. We have a report or dashboard which is looking for only a subset of data in Hive. We want to allow the filtering of data to occur and for Hive to receive the filtering from our OLAP/Dashboard.
By default LucidDB will read the entire table and do all SQL functions over in our
However, pulling over the entire table is really not going to work well for us. This would really be the worst of both worlds; you’d be better off just querying Hive directly. However, luckily we’ve enabled some pushdowns be pushed down to Hive.
Notice that the condition IN( values ) is being pushed down to the remote Hive server.
Let’s try something a bit more complex!
Currently, we’re able to push down most filters and projections.
Let’s take now take the use case where we’re trying to ONLY UPDATE records that have been updated since the last time we checked (ID > 97). More likely the key that we’d use to do this push down filter would be a date, but you can simply use your imagination.
Consider the following SQL:
merge into hive_test.”local_POKES” l
using (select * from hive_test.pokes where “foo” > 97)
ON l.”foo” = p.”foo”
when matched then update set “bar” = p.”bar”
when not matched then insert (“foo”, “bar”) values (p.”foo”, p.”bar”);
This SQL is a typical “incremental” load from a remote system. Syntactically a little dense, but it’s actually a VERY high performance method to load directly into LucidDB often eliminating the need entirely to draw the data through an intermediate server and process (ETL Tool).
Our enhancements allow the Hive portion to be pushed down. Hive will ONLY return values greater than 97 and we’ll simply intelligently keep any changed records “up to date” in LucidDB for reporting.
Many of these changes will be in a patched version of LucidDB; we’ll make this patched release available to any customers who want these optimizations available, immediately for use with Hive. Let us know what you think by joining this conversation at the LucidDB forums: Hive Connector Pushdown
In a subsequent blog we’ll cover how to now match up data coming from Hive (or CouchDB) with data in other systems for reporting.
0.9.4 did not hit the 1 year mark!
Our last LucidDB release was now, just more than 12 months ago on June 16, 2010. We were really really trying to beat the 1 year mark for our 0.9.4 release but we just couldn’t. A tenet of good, open source development is early and often and we need to do better. Since the 0.9.3 release we’ve:
- Built out an entire Web Services infrastructure
- Developed a wicked cool Admin user interface
- Developed cool connectors to Hive, CouchDB
- Built a whole ton of extensions (auto indexing, DDL generation, improved load routines)
- Scriptable functions, and procedures
- Updated our connectors (JDBC, Salesforce, etc)
All in this is a VERY exciting release… I apologize it’s taken this long, but please bear with us. We’ll be release in the next couple of weeks!
Why OLAP4J 1.0 matters
Julian Hyde and his cohorts on the Mondrian project have been busy at work for nearly 5 years (spec 0.5 done in 2006!) working on the difficult, but worthwhile effort of standardizing client side access to OLAP in Java.
They just released version 1.0! This is a big deal; bigger players have attempted and failed at this before (ahem JOLAP). Kudos to Julian, Luc and the others involved to get such a *real* standard in place!
There’s a few reasons why this matters to everyone in Business Intelligence. Not just Java devs and Open Source BI fans.
- Only existing “de facto” standards are owned by MSFT
XML/A was touted as **the** industry standard for OLAP client server communications. You can think of XML/A as the SOAP equivalent of OLAP client libraries. There are a few problems with this.First is that MSFT always treated this like they do all “open” standards; just open enough to get what they need out of it (SQUASH JOLAP) but never really open. For instance, reading the spec, notice that the companies involved specifically note that they all absolutely reserve the right to enforce their patent rights on their technology, EVEN IF it’s part of the spec. ie, it’s open, but if you actually IMPLEMENT it you might have to pay MSFT for it.
Second is that XML/A is now a fragmented standard. Similar to SQL, MDX support and other line protocol extensions (ahem, Binary Secured XML/A) means that there’s no one really making any sort of technology toolkit, collection of drivers, etc. Simba does much of this in their lab in Vancouver, but they’re the exact opposite of open. In fact, when the XML/A council vanished they pounced and picked up the site which is now a simple shill for their products. A couple of guys at a single company without any open publication on variations in MDX/implementations is counterproductive to real interoperability.
Third is that SOAP is soooooo 1999. SOAP is fundamental in XML/A and there are many interesting (Saiku) ways of serving client server. REST, direct sockets, in memory, etc.
- Helps keep Mondrian from being fused to Pentaho Analyzer
Mondrian is a very successful open source project and serves as the basis (server part) of Pentaho’s Analyzer (acquired from LucidEra). Pentaho has clearly signaled their (lack of) commitment to upkeep of their open source frontends; Analyzer is proprietary software that Pentaho has committed all their OLAP UI efforts behind, leaving the community with an aging JPivot front end to Mondrian. Clearly underestimating what the community has to offer, the community has delivered a replacement project Saiku to address this.
OFF TOPIC: I’ve made several Open Source BI predictions and with the exception of Pentaho Sreadsheet Services (which technically wasn’t OSS) I’ve been right every time. Here’s one for ya: Saiku will outshine Analyzer in the next 18mos and both technologies will be worse off because Pentaho, ironically and increasingly, chose proprietary instead of community. Ahhh… I feel better having said it.
Keeping Mondrians primary exterior API as a standard helps ensure that Mondrian can not be subsumed (entirely) by Pentaho and that innovation can continue with multiple community projects doing shiny UI work on top of Mondrian.
- A single, pragmatically useful API enables binding to other languages as well (ie, non Java)
Saiku, basing their open source RESTful server on top of OLAP4J has now enabled cool mashable OLAP access to not JUST Mondrian (which was already available via SOAP/.xactions) but anyone else who creates a driver (SAP, SSAS, etc). By actually having a real project that can collect up a real open driver implementation with a few implementations means that projects like Saiku (which actually has client APIs for C, Obj-C, Ruby, ActionScript, etc).
I wouldn’t be surprised if there are others layers (ADOMD?) that leverage OLAP4J as well.
Java OLAP nerds unite! Non Java nerds checkout Saiku. Pentaho users/customers know that OLAP4J is good for keeping Mondrian open and innovative. OLAP innovation is alive and well, led by Mondrian, Saiku, Pentaho, etc. Happy MDX’ing.
LucidDB has a new Logo/Mascot
At yesterdays Eigenbase Developer Meetup at SQLstream‘s offices in San Francisco we arrived at a new logo for LucidDB. DynamoBI is thrilled to have supported and funded the design contest to arrive at our new mascot. Over the coming months you’ll see the logo make it’s way out to the existing luciddb.org sites, wiki sites, etc. I’m really happy to have a logo that matches the nature of our database - BAD ASS!
DynamoDB: Built in Time Dimension support!
DynamoDB (aka LucidDB) is not just another column store database. Our goal is being the best database for actually doing Business Intelligence; while that means being fast and handling large amounts of data there’s a lot of other things BI consultant/developers need. I’ll continue to post about some of the great BI features that DynamoDB has for the modern datasmiths.
First feature to cover that’s dead easy, is the built in ability to generate a time dimension, including a Fiscal Calendar attributes. If you’re using Mondrian (or come to that, your own custom SQL on a star schema) you need to have a time dimension. Time is the most important dimension! Every OLAP model I’ve ever built uses one! It something that you, as a datasmith will need to do with every project; that’s why we’ve built it right into our database.
Here’s a dead simple way to create a fully baked, ready to use Time Dimension to use with Mondrian.
-- Create a view that is our time dimension for 10 years, with our -- Fiscal calendar starting in March (3) create view dim_time as select * from table(applib.fiscal_time_dimension (2000, 1, 1, 2009, 12, 31, 3));
OK, that’s it. You’ve created a Time Dimension! * see NOTE at end of post.
So, we’ve created our time dimension, complete with a Fiscal calendar for 10 years in a single statement! Awesome – but what does it contain?
-- Structure of new time dimension select "TABLE_NAME", "COLUMN_NAME", "DATATYPE" from sys_root.dba_columns where table_name = 'DIM_TIME'; +-------------+---------------------------------+-----------+ | TABLE_NAME | COLUMN_NAME | DATATYPE | +-------------+---------------------------------+-----------+ | DIM_TIME | FISCAL_YEAR_END_DATE | DATE | | DIM_TIME | FISCAL_YEAR_START_DATE | DATE | | DIM_TIME | FISCAL_QUARTER_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | FISCAL_QUARTER_END_DATE | DATE | | DIM_TIME | FISCAL_QUARTER_START_DATE | DATE | | DIM_TIME | FISCAL_MONTH_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | FISCAL_MONTH_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | FISCAL_MONTH_END_DATE | DATE | | DIM_TIME | FISCAL_MONTH_START_DATE | DATE | | DIM_TIME | FISCAL_WEEK_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | FISCAL_WEEK_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | FISCAL_WEEK_NUMBER_IN_MONTH | INTEGER | | DIM_TIME | FISCAL_WEEK_END_DATE | DATE | | DIM_TIME | FISCAL_WEEK_START_DATE | DATE | | DIM_TIME | FISCAL_DAY_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | FISCAL_DAY_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | FISCAL_YEAR | INTEGER | | DIM_TIME | YEAR_END_DATE | DATE | | DIM_TIME | YEAR_START_DATE | DATE | | DIM_TIME | QUARTER_END_DATE | DATE | | DIM_TIME | QUARTER_START_DATE | DATE | | DIM_TIME | MONTH_END_DATE | DATE | | DIM_TIME | MONTH_START_DATE | DATE | | DIM_TIME | WEEK_END_DATE | DATE | | DIM_TIME | WEEK_START_DATE | DATE | | DIM_TIME | CALENDAR_QUARTER | VARCHAR | | DIM_TIME | YR | INTEGER | | DIM_TIME | QUARTER | INTEGER | | DIM_TIME | MONTH_NUMBER_OVERALL | INTEGER | | DIM_TIME | MONTH_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | MONTH_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | MONTH_NAME | VARCHAR | | DIM_TIME | WEEK_NUMBER_OVERALL | INTEGER | | DIM_TIME | WEEK_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | WEEK_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | WEEK_NUMBER_IN_MONTH | INTEGER | | DIM_TIME | DAY_FROM_JULIAN | INTEGER | | DIM_TIME | DAY_NUMBER_OVERALL | INTEGER | | DIM_TIME | DAY_NUMBER_IN_YEAR | INTEGER | | DIM_TIME | DAY_NUMBER_IN_QUARTER | INTEGER | | DIM_TIME | DAY_NUMBER_IN_MONTH | INTEGER | | DIM_TIME | DAY_NUMBER_IN_WEEK | INTEGER | | DIM_TIME | WEEKEND | VARCHAR | | DIM_TIME | DAY_OF_WEEK | VARCHAR | | DIM_TIME | TIME_KEY | DATE | | DIM_TIME | TIME_KEY_SEQ | INTEGER | +-------------+---------------------------------+-----------+ -- Let's look at a few rows select time_key_seq, time_key, yr, month_number_in_year, fiscal_year , fiscal_month_number_in_year from dim_time; +---------------+-------------+-------+-----------------------+--------------+------------------------------+ | TIME_KEY_SEQ | TIME_KEY | YR | MONTH_NUMBER_IN_YEAR | FISCAL_YEAR | FISCAL_MONTH_NUMBER_IN_YEAR | +---------------+-------------+-------+-----------------------+--------------+------------------------------+ | 1 | 2000-01-01 | 2000 | 1 | 2000 | 11 | | 2 | 2000-01-02 | 2000 | 1 | 2000 | 11 | | 3 | 2000-01-03 | 2000 | 1 | 2000 | 11 | | 4 | 2000-01-04 | 2000 | 1 | 2000 | 11 | | 5 | 2000-01-05 | 2000 | 1 | 2000 | 11 | | 6 | 2000-01-06 | 2000 | 1 | 2000 | 11 | | 7 | 2000-01-07 | 2000 | 1 | 2000 | 11 | | 8 | 2000-01-08 | 2000 | 1 | 2000 | 11 | | 9 | 2000-01-09 | 2000 | 1 | 2000 | 11 | | 10 | 2000-01-10 | 2000 | 1 | 2000 | 11 | +---------------+-------------+-------+-----------------------+--------------+------------------------------+
Generating the Time Dimension is accomplished using DynamoDBs ability to include Java based UDF Table Functions. Table functions are really powerful – they allow a BI developer to write custom functions that output a “table” that can be queried like ANY OTHER TABLE (mostly). Check out the wiki page FarragoUdx if your interested.
And of course: download LucidDB and give it a whirl!
NOTE: To be candid, doing it as a view isn’t the best approach. For anything beyond tiny (5 million +) we should actually create the table, and do an INSERT INTO SELECT * FROM TABLE(fiscal_time_dimension).
Book Review: Pentaho Reporting 3.5 for Java Developers
I have two customers that if they had access to Will Gormans book, Pentaho Reporting 3.5 for Java Developers, they would not have needed me for their project! That’s how good the book is for those who need to embed Pentaho Reporting into their Java application.
The book is certainly geared towards Java developers, and specifically, developers you are trying to simply use the Pentaho reporting library. I’d venture to say that MOST customers should be using Pentaho; in this case, the book is useful as a reference, but the HOWTO past Chapter 3 would probably be lost on many users; except for Chapter 11 (see below).
However, for people trying to embed Pentaho reporting, WOW: THIS IS THE DEFINITIVE RESOURCE. Buy it, RIGHT NOW! The information it contains was locked in just a few peoples minds (Thomas, Bunch of People sitting at the “citadel” in Orlando aka Pentaho Employees, a handful of consultants). Will has unlocked it and I’m glad he did.
Will taught me something new in this book. In fact, I hope this is “new” in 3.5 which was release just a few weeks back. If it’s been around longer than I’m a total dolt. Chapter 11 covers how to add your own custom Expressions/Formulas to Pentaho (including the PRD).
At customer engagements, or when I put on my Pentaho hat and teach their public courses, or custom onsite training, I’m asked all the time: Can I make my own Reporting Functions and plug them into Pentaho Report Designer? Up until WIll showed me how to do it on page 281, I thought this was only possible for Pentaho (the company). Will gives us a step by step guide to add our own “DoMyCustomThing” to the Pentaho Report Designer. Customers can now create their own corporate expressions/functions they can leverage across hundreds of reports.
I’ll keep several copies on my shelf, and give it away to any current/future “embedded Pentaho Reporting” customers. Thanks Will for such a great book!
DynamoBI: website? bits?
Well, what a soft launch it has been.
Some people have asked:
When are you going to get a website? Errr…. Soon! We soft launched a bit early, due to some “leaking information” but figured heck, it’s open source let’s let it all out. Soon enough, I swear!
Where can I download DynamoDB? Errr… you can’t yet cause we haven’t finished our build/QA/certification process.
However, since DynamoDB is the alter ego business suit wearing brother of LucidDB, just download the 0.9.2 release if you want to get a sense of what DynamoDB is.
There are 3 built binaries (Linux 32, Linux 64, and Windows 32): http://sourceforge.net/projects/luciddb/files/luciddb/luciddb-0.9.2/ and you can find installation instructions here.
DynamoDB will have the same core database, etc. So, from a raw feature/function perspective what you download and see with LucidDB will be what you get in DynamoDB. DynamoDB will have an administration UI to make things like setting up foreign servers, managing users, etc easier. And lots of other cool new features on the longer term roadmap, which if when we get a website would be a great place for that to go!
Until then, use the open source project, LucidDB. I think you’ll like it!
OpenSQLCamp 2009 with LucidDB
LucidDB is now a sponsor OpenSQLCamp 2009, courtesy of DynamoBI which doesn’t have its own logo yet!

I’ll be attending representing the LucidDB camp, since Mr. John Sichi will be running a marathon (1/2?) that weekend. Let me know if you’ll be there!





