Category Archives: Pentaho

Splunk is NoSQL-eee and queryable via SQL

Last week at the Splunk user conference in San Francisco, Paul Sanford from the Splunk SDK team demoed a solution we helped assemble showing easy SQL access to data in Splunk. It was very well received and we look forward to helping Splunk users access their data via SQL and commodity BI tools!

While you won’t find it in their marketing materials, Splunk is a battle hardened, production grade NoSQL system (at least for analytics). They have a massively scalable, distributed compute engine (ie, Map-Reduce-eee) along with free form schema-less type data processing. So, while they don’t necessarily throw their ring into new NoSQL type projects (and let’s be honest, if you’re not Open Source it’d be hard to get selected for a new project) their thousands of customers have been very happy with their ingestion, alerting, and reporting on IT data and is very NoSQL-eee.

The SDK team has been working on opening up the underlying engine, framework so that additional developers can innovate and do some cool stuff on top of Splunk. Splunk developers from California (some of whom worked at LucidEra prior) kick started a project that gives LucidDB the ability to talk to Splunk hereby enabling SQL access to commodity BI tools. We’ve adopted the project, and built out some examples using Pentaho to show the power of SQL access to Splunk.


First off, our overall approach is similar to our existing ability to talk to CouchDB, Hive/Hadoop, etc remains the same.

  • We learn how to speak the remote language (in this case, Spunk search queries) generally. This means we can simply stream data across the wire in it’s entirety and then do all the SQL
  • We enable some rewrite rules so that if the remote system (Splunk) knows how to do things (such as simple filtering in the WHERE clause, or GROUP BY stats) we’ll rewrite our query and push more to the work down to the remote system.

Once we’ve done that, we can enable any BI tool (that can do things such as SQL, read catalogs/tables, enable metadata/etc) connect up and do cool things like drag and drop reports. Here’s some examples created using Pentaho’s Enterprise Edition 4.0 BI Suite (which looks great, btw!):



These dashboards were created, entirely within a Web Browser using Drag and Drop (via Pentaho’s modeling/report building capabilities). Total time to build these dashboards was less than 45 minutes including model definition and report building (caveat: I know Pentaho dashboards inside and out).

Splunk users can now access data in their Splunk system, including matching/mashing it with others simply and easily in everyday, inexpensive BI tools.

In fact, this project came about initially as a Splunk customer wanted to do some advanced visualization in Tableau. Using our experimental ODBC Connectivity the user was able to visualize their Splunk data in Tableau using some of their fantastic visualizations.

PDI Loading into LucidDB

By far, the most popular way for PDI users to load data into LucidDB is to use the PDI Streaming Loader. The streaming loader is a native PDI step that:

  • Enables high performance loading, directly over the network without the need for intermediate IO and shipping of data files.
  • Lets users choose more interesting (from a DW perspective) loading type into tables. In particular, in addition to simple INSERTs it allows for MERGE (aka UPSERT) and also UPDATE. All done, in the same, bulk loader.
  • Enables the metadata for the load to be managed, scheduled, and run in PDI.


However, we’ve had some known issues. In fact, until PDI 4.2 GA and LucidDB 0.9.4 GA it’s pretty problematic unless you run through the process of patching LucidDB outlined on this page: Known Issues.

In some ways, we have to admit, that we released this piece of software too soon. Early and often comes with some risk, and many have felt the pain of some of the issues that have been discovered with the streaming loader.

In some ways, we’ve built an unnatural approach to loading for PDI: PDI wants to PUSH data into a database. LucidDB wants to PULL data from remote sources, with it’s integrated ELT and DML based approach (with connectors to databases, salesforce, etc).   Our streaming loader “fakes” a pull data source, and allows PDI to “push” into it.

There’s mutliple threads involved, when exceptions happen users have received cruddy error messages such as “Broken Pipe” that are unhelpful at best, frustrating at worse. Most all of these contortions will have sorted themselves out and by the time 4.2 GA PDI and 0.9.4 GA of LucidDB are released the streaming loader should be working A-OK. Some users would just assume avoid the patch instructions above and have posed the question: In a general sense, if not the streaming loader how would I load data into LucidDB?

Again, LucidDB likes to “pull” data from remote sources. One of those is CSV files. Here’s a nice, easy, quick (30k r/s on my MacBook) method to load a million rows using PDI and LucidDB:


This transformation outputs to a Text File 1 million rows, waits for that to complete then proceeds to the load that data into a new table in LucidDB. Step by Step the LucidDB statements

— Points LucidDB to the directory with the just generated flat file
— LucidDB has some defaults, and we can “guess” the datatypes by scanning the file
— Let’s create a foreign table for the data file (“DATA.txt”) that was output by PDI
>create foreign table server csv_file_server;
— Create a staging, and load the data from the flat file (select * from

We hope to have the streaming loader ready to go in 0.9.4 (LucidDB) and 4.2 (PDI). Until then, consider this easy, straight forward method of loading data that’s high performance, proven, and stable for loading data from PDI into LucidDB.

Example file: csv_luciddb_load.ktr

SQL access to CouchDB views

Following up on my first post on an alternative, more SQL-eee metadata driven approach to doing BI on Big Data, I’d like to share an example on how we can enable easy reporting on top of BIg Data immediately for CouchDB users. We’re very keen on discussing with CouchDB/Hive/other Big Data users about their Ad Hoc and BI needs; please visit the forum thread about the connector.

We’ve been working with some new potential customers on how to leverage their investment in Big Data (specifically Big Couch provided by the fine folks at Cloudant. In particular, this prospects users are thrilled with their dashboards and analytics that are baked into their application, but they need to be able to enable Ad Hoc, easy analysis on top of their data via a traditional reporting tool. Like any organization, IT is overworked and reports always seem to find their way to close to the back of the priority. Our initiative at this customer is about enabling easy access to their high performance and well tuned/architected CouchDB instance.

I figured I could generalize what we’re doing at this prospect to a more general post, and hope that this post (and the software behind it, namely LucidDB) finds there way into the hands of others that are trying to do Ad Hoc and easy SQL based analytics on top of CouchDB.

First, the overview:

We’ve created a solution that allows HTTP views in CouchDB to be made accessible via SQL inside of LucidDB. This means two things:

  • Direct access to LIVE, real time data in CouchDB is available to any tool that can talk to LucidDB (SQL + JDBC/ODBC).
  • We can use this access to very easily, keep a cached up to date “cache” of this data in LucidDB, an optimized column storage database. In fact, the periodic refresh is a single MERGE statement (TO BE COVERED IN A SUBSEQUENT BLOG)

Our vision of Big Data BI requires that the user be able to “choose” which version of the data they want, easily. Cached and summarized inside LucidDB -OR- up to date, real time results from Big Data (aka CouchDB). This example focuses on the first part of the “hot or cold” level of aggregation ability; namely the live “HOT” piece.

First, let’s look at our source data. We’ve used the Federal Purchasing Dataset (a sample of 1000 records, since this runs on my laptop). Here’s a couple of screenshots from Futon that show the source data docs, and then the CouchDB View.


and also the “view” that we’ll access


NOTE: This view is definitely atypical. It is just emitting docs/flattening. it is NOT doing any aggregation with is what is almost always done with HTTP views. Our connector initially supports streaming data from the values, but we’re already working on getting proper support for the KEYs (and ability to push down group_level, etc).

There’s another key piece of our philosophy on Big Data BI that is also present here. Notice that we’re not trying to “dumb down” the source, Big Data system? We’re not trying to strangely Map SQL onto some remote language so that you’ve lost some richness. CouchDB’ers take note: You can do all that you normally can and want to do in that view. We believe in effeciently accessing (group_level optimizations) not about reducing your abilities to write very interesting, complex map / reduce that returns your base level results.

Let’s make these available, via ANSI SQL, in LucidDB.

I’ll start with a vanilla installation of LucidDB (NOTE: I used a pre release version of 0.9.4 for this).

First, let’s give LucidDB the ability, in a general way, to talk to CouchDB as a foreign data source:


Now LucidDB has the ability to talk to CouchDB, in a general sense. Let’s now configure and access our test dataset.

First step is to create a server connection from LucidDB to our CouchDB server:


Now, we’ve established a connection from our LucidDB installation to CouchDB; let’s now make that “all” view available via SQL.


That statement looks complicated. It is, unfortunately, a little more involved than this process usually is. If you were talking to say, Oracle database, we can import and “discover” all this information (datatypes, column names, etc) from the driver. We have not done all this additional metadata work for the CouchDB connector yet; or actually, how to get ALL this information from the remote view as well.

So, we’ve told LucidDB about the server, the view, the datatypes/fields. We’re all set to now do ANY SQL we like on top of it.


How about doing some aggregations?


We’ll cover creating the “local” cached copy of the CouchDB dataset in the next section.

Why OLAP4J 1.0 matters

Julian Hyde and his cohorts on the Mondrian project have been busy at work for nearly 5 years (spec 0.5 done in 2006!) working on the difficult, but worthwhile effort of standardizing client side access to OLAP in Java.

They just released version 1.0! This is a big deal; bigger players have attempted and failed at this before (ahem JOLAP).  Kudos to Julian, Luc and the others involved to get such a *real* standard in place!

There’s a few reasons why this matters to everyone in Business Intelligence.  Not just Java devs and Open Source BI fans.

  • Only existing “de facto” standards are owned by MSFT

    XML/A was touted as **the** industry standard for OLAP client server communications.  You can think of XML/A as the SOAP equivalent of OLAP client libraries.  There are a few problems with this.First is that MSFT always treated this like they do all “open” standards; just open enough to get what they need out of it (SQUASH JOLAP) but never really open.     For instance, reading the spec, notice that the companies involved specifically note that they all absolutely reserve the right to enforce their patent rights on their technology, EVEN IF it’s part of the spec.  ie, it’s open, but if you actually IMPLEMENT it you might have to pay MSFT for it.

    Second is that XML/A is now a fragmented standard.  Similar to SQL, MDX support and other line protocol extensions (ahem, Binary Secured XML/A) means that there’s no one really making any sort of technology toolkit, collection of drivers, etc.  Simba does much of this in their lab in Vancouver, but they’re the exact opposite of open.  In fact, when the XML/A council vanished they pounced and picked up the site which is now a simple shill for their products.  A couple of guys at a single company without any open publication on variations in MDX/implementations is counterproductive to real interoperability.

    Third is that SOAP is soooooo 1999.  SOAP is fundamental in XML/A and there are many interesting (Saiku) ways of serving client server.  REST, direct sockets, in memory, etc.

  • Helps keep Mondrian from being fused to Pentaho Analyzer

    Mondrian is a very successful open source project and serves as the basis (server part) of Pentaho’s Analyzer (acquired from LucidEra).  Pentaho has clearly signaled their (lack of) commitment to upkeep of their open source frontends; Analyzer is proprietary software that Pentaho has committed all their OLAP UI efforts behind, leaving the community with an aging JPivot front end to Mondrian.  Clearly underestimating what the community has to offer, the community has delivered a replacement project Saiku to address this.

    OFF TOPIC: I’ve made several Open Source BI predictions and with the exception of Pentaho Sreadsheet Services (which technically wasn’t OSS) I’ve been right every time.  Here’s one for ya: Saiku will outshine Analyzer in the next 18mos and both technologies will be worse off because Pentaho, ironically and increasingly, chose proprietary instead of community. Ahhh… I feel better having said it.

    Keeping Mondrians primary exterior API as a standard helps ensure that Mondrian can not be subsumed (entirely) by Pentaho and that innovation can continue with multiple community projects doing shiny UI work on top of Mondrian.

  • A single, pragmatically useful API enables binding to other languages as well (ie, non Java)

    Saiku, basing their open source RESTful server on top of OLAP4J has now enabled cool mashable OLAP access to not JUST Mondrian (which was already available via SOAP/.xactions) but anyone else who creates a driver (SAP, SSAS, etc).  By actually having a real project that can collect up a real open driver implementation with a few implementations means that projects like Saiku (which actually has client APIs for C, Obj-C, Ruby, ActionScript, etc).

    I wouldn’t be surprised if there are others layers (ADOMD?) that leverage OLAP4J as well.

Java OLAP nerds unite!  Non Java nerds checkout Saiku.  Pentaho users/customers know that OLAP4J is good for keeping Mondrian open and innovative.  OLAP innovation is alive and well, led by Mondrian, Saiku, Pentaho, etc.  Happy MDX’ing.

Encrypt PDI passwords

PDI has a basic obfuscation method for making it difficult for casual people to lift passwords for DB connections. I have customers that maintain different versions of a “shared.xml” file that maintain different physical connections to databases (think development, QA/testing, and production).

In order to generate the different shared.xml, a user has to usually (per Matt Casters comment below there is a utility that allows user to do this outside of Spoon) open up PDI, created the connections, save them, and then sometimes copy and paste the sections needed to create their “dev” version of shared.xml or their “production” version of shared.xml. Many times this just to generate the password, as they can hand edit the other pieces (hostname, schema, etc).

I just committed a quick little PDI transformation that gives you the PDI encrypted form of a password.


Happy Password Encrypting!

Instant Relief from Slow MySQL Reporting Queries using DynamoDB

Here’s the scenario. You’ve got a table in MySQL for reporting that has a few million rows, and is denormalized for reporting. You’ve got a Pentaho Report that is querying this MySQL table. You have two problems with the current report.

  1. Your users are complaining that the query is slow, and they have to wait around for longer than they’d like to see their report. (approx 40s)
  2. Your DBAs are cranky because they see the size of this table is getting bigger. (approx 1.8GB)

MySQL is fundamentally designed to be an OLTP database and while it does a fantastic job at that, its data warehouse features were built as “bolt on” additions. Can it be used for BI? Absolutely, I’ve used it a many customer sites. Does DynamoDB provide a better set of features/capabilities for doing BI? We think so! Are they both 100% open source? You bet;why not choose the right tool for the right job then?

DynamoDB (aka LucidDB) is a “purpose built for BI” database. What does that mean? Well, I’ll be blogging about a lot of features that speak to our philosophy of a complete “BI Database” not just a fast one. One of the features that makes LucidDB complete, and not just a drag racer, is its ability to connect to remote data sources via JDBC and retrieve data. If you’re doing simple table replications, you don’t have to use an ETL tool, or do export or imports, or LOAD DATA INFILEs, etc. Our ability to connect to remote databases and access them as “remote tables” makes retrieving data into DynamoDB as easy as “insert into mytable select * from remote_table.”

Back to our original issue with our current MySQL

Our report is slow, and our database is big. How slow? Well, not really that bad, but at about 40s per query run that’s enough to tempt your business analyst to go fetch a coffee instead of continuing his work. How big? Well, not really that big, but at about 1.8GB it’s starting to get non trivial in terms of tuning the I/O etc.

Our goal is to improve both using DynamoDB; we’ll leave MySQL as our main OLTP application. We’re not trying to replace it – in fact, we’ll embrace MySQL as the system of record and simply “slurp/report” off this table in a separate reporting environment.

It’s a two step process.

  1. Connect from DynamoDB to MySQL using a JDBC connector, access the remote table, and draw over the data using a simple INSERT statement.
  2. Change our Pentaho Report to use the DynamoDB JDBC connector instead of MySQL.

Our Pentaho Report is based on the following SQL

SELECT t.Carrier as “CARRIER”,
c as “C”, c2 as “C2”, c*1000/c2 as “C3” FROM
(SELECT Carrier, count(Carrier) AS c FROM ontime
WHERE DepDelay>10 GROUP BY Carrier) t JOIN
(SELECT Carrier, count(Carrier) AS c2 FROM ontime
GROUP BY Carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

This takes approximately 40s to run on MySQL database running the same machine.

Step 1: Connect, and load the data into our DynamoDB table.

— Create DynamoDB reporting table first
create schema faster;

create table faster.”ontime” (
“Year” int,
“Quarter” tinyint ,
“Month” tinyint ,
….. Abbreviated for Brevity ….
“Div5TailNum” varchar(10)

— Get access the MySQL table OnTime in the OTP schema on host localhost
create schema MYSQL_SOURCE;
set schema ‘MYSQL_SOURCE’;

sys_jdbc OPTIONS (
driver_class ‘com.mysql.jdbc.Driver’,
url ‘jdbc:mysql://localhost/otp?useCursorFetch=true’,
user_name ‘root’,
password ‘easy’,
fetch_size ‘1000’,
table_types ‘TABLE’,
schema_name ‘otp’);

import foreign schema OTP from server MYSQL_REMOTE_SOURCE into MYSQL_SOURCE;

— Load DynamoDB table from MySQL database directly
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime”;

Notice that last statement. You don’t have to export to intermediate files, or use an ETL tool (not that that’s bad, I’m a big fan of ETL tools!). You can use good old fashioned SQL to get data from a remote database into DynamoDB.

Step 2: Change the Pentaho Report to use the new connection.

We open up our report and change our connection from MySQL

to DynamoDB

NOTE: Until we finish our QA’ed builds we’re using LucidDB driver instead of DynamoDB but they are, one and the same.

We make some minor adjustment to the SQL (quoting some tables/etc) and rerun our query and Voila, our report runs in 10s down from 40s, an improvement of 400%.

How about storage? Our storage report shows that DynamoDB is using only .3 GB to store the same 7 Million records as compared to MySQL at 1.8GB, or 1/6 of the storage.

Not a bad investment of a few minutes of time, I’d say. DynamoDB (LucidDB) takes just a few minutes to install, and because of its focus on BI you should find things like retrieving data from remote data sources easy, and effective. Let’s be truthful here as well; once you speed up a report by 400% and reduce its storage by 6x your boss will be calling you a dynamo.

Notes: Full set of scripts posted here: Original queries and dataset from Vadim at MySQLPerformanceBlog.

Amazon's Pre Ordering of books sucks!

I pre-ordered a copy of the new, (first, only, best, and original) Pentaho book “Pentaho Solutions” by Roland Bouman and Jos van Dongen two weeks back.  Saw from a tweet that the book was shipping from Amazon.  Cool – had a look at the page.  Sure, they can ship today if I get my order in on time so I know they can ship it.

How about my pre order, which I would assume would go out before regular orders?  Won’t ship until next week?  Delivered by 9/11/2009?  Lesson learned – don’t pre order from Amazon.  🙂

CDF Tutorials

The folks at webdetails have posted their Pentaho Community Dashboard Framework tutorials that look great!  They run you through building CDF dashboards which is usually a crucial, user facing part of any BI implementations.  While much of the work is the ETL/OLAP configuration, tuning, etc on the backend most users think of Pentaho as the dashboard/reports they interact with not the data munching for the Data Warehouse.

These tutorials look great; I’ve implemented more than 20 CDF dashboards at four customers already but I still bought them to learn even more ins and outs.  You should too! No better way to learn something than from the source of the technology which in this case is Pedro and team @ webdetails.

MDX Humor from Portugal

Pedro Alves, the very talented lead developer behind the Pentaho Community Dashboard Framework gave me a good chuckle with his high opinion of MDX as a language:

MDX is God’s gift to business language; When God created Adam and Eve he just spoke [Humanity].[All Members].Children . That’s how powerful MDX is. And Julian Hyde allowed to use it without being bound to microsoft.

If you haven’t checked out Pedro’s blog, definitely get over there. It’s a recent start but he’s already getting some great stuff posted.

PDI Scale Out Whitepaper

I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges – one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.

None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?

I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.

Another interesting set of findings in the paper also relates to a very pragmatic approach in my research – I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.

Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:

High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)

If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.