Archive for the 'Pentaho' Category

Encrypt PDI passwords

Friday, January 29th, 2010

PDI has a basic obfuscation method for making it difficult for casual people to lift passwords for DB connections. I have customers that maintain different versions of a “shared.xml” file that maintain different physical connections to databases (think development, QA/testing, and production).

In order to generate the different shared.xml, a user has to usually (per Matt Casters comment below there is a utility that allows user to do this outside of Spoon) open up PDI, created the connections, save them, and then sometimes copy and paste the sections needed to create their “dev” version of shared.xml or their “production” version of shared.xml. Many times this just to generate the password, as they can hand edit the other pieces (hostname, schema, etc).

I just committed a quick little PDI transformation that gives you the PDI encrypted form of a password.

201001291332

Happy Password Encrypting!

Instant Relief from Slow MySQL Reporting Queries using DynamoDB

Monday, November 2nd, 2009

Here’s the scenario. You’ve got a table in MySQL for reporting that has a few million rows, and is denormalized for reporting. You’ve got a Pentaho Report that is querying this MySQL table. You have two problems with the current report.

  1. Your users are complaining that the query is slow, and they have to wait around for longer than they’d like to see their report. (approx 40s)
  2. Your DBAs are cranky because they see the size of this table is getting bigger. (approx 1.8GB)

MySQL is fundamentally designed to be an OLTP database and while it does a fantastic job at that, its data warehouse features were built as “bolt on” additions. Can it be used for BI? Absolutely, I’ve used it a many customer sites. Does DynamoDB provide a better set of features/capabilities for doing BI? We think so! Are they both 100% open source? You bet;why not choose the right tool for the right job then?

DynamoDB (aka LucidDB) is a “purpose built for BI” database. What does that mean? Well, I’ll be blogging about a lot of features that speak to our philosophy of a complete “BI Database” not just a fast one. One of the features that makes LucidDB complete, and not just a drag racer, is its ability to connect to remote data sources via JDBC and retrieve data. If you’re doing simple table replications, you don’t have to use an ETL tool, or do export or imports, or LOAD DATA INFILEs, etc. Our ability to connect to remote databases and access them as “remote tables” makes retrieving data into DynamoDB as easy as “insert into mytable select * from remote_table.”

Back to our original issue with our current MySQL

Our report is slow, and our database is big. How slow? Well, not really that bad, but at about 40s per query run that’s enough to tempt your business analyst to go fetch a coffee instead of continuing his work. How big? Well, not really that big, but at about 1.8GB it’s starting to get non trivial in terms of tuning the I/O etc.

Our goal is to improve both using DynamoDB; we’ll leave MySQL as our main OLTP application. We’re not trying to replace it - in fact, we’ll embrace MySQL as the system of record and simply “slurp/report” off this table in a separate reporting environment.

It’s a two step process.

  1. Connect from DynamoDB to MySQL using a JDBC connector, access the remote table, and draw over the data using a simple INSERT statement.
  2. Change our Pentaho Report to use the DynamoDB JDBC connector instead of MySQL.

Our Pentaho Report is based on the following SQL

SELECT t.Carrier as “CARRIER”,
c as “C”, c2 as “C2″, c*1000/c2 as “C3″ FROM
(SELECT Carrier, count(Carrier) AS c FROM ontime
WHERE DepDelay>10 GROUP BY Carrier) t JOIN
(SELECT Carrier, count(Carrier) AS c2 FROM ontime
GROUP BY Carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3 DESC;

200911022216
This takes approximately 40s to run on MySQL database running the same machine.

Step 1: Connect, and load the data into our DynamoDB table.

– Create DynamoDB reporting table first
create schema faster;

create table faster.”ontime” (
“Year” int,
“Quarter” tinyint ,
“Month” tinyint ,
….. Abbreviated for Brevity ….
“Div5TailNum” varchar(10)
);

– Get access the MySQL table OnTime in the OTP schema on host localhost
create schema MYSQL_SOURCE;
set schema ‘MYSQL_SOURCE’;

CREATE SERVER MYSQL_REMOTE_SOURCE FOREIGN DATA WRAPPER
sys_jdbc OPTIONS (
driver_class ‘com.mysql.jdbc.Driver’,
url ‘jdbc:mysql://localhost/otp?useCursorFetch=true’,
user_name ‘root’,
password ‘easy’,
fetch_size ‘1000′,
table_types ‘TABLE’,
schema_name ‘otp’);

import foreign schema OTP from server MYSQL_REMOTE_SOURCE into MYSQL_SOURCE;

– Load DynamoDB table from MySQL database directly
insert into FASTER.”ontime” select * from MYSQL_SOURCE.”ontime”;

Notice that last statement. You don’t have to export to intermediate files, or use an ETL tool (not that that’s bad, I’m a big fan of ETL tools!). You can use good old fashioned SQL to get data from a remote database into DynamoDB.

Step 2: Change the Pentaho Report to use the new connection.

We open up our report and change our connection from MySQL

200911022234
to DynamoDB

200911022236
NOTE: Until we finish our QA’ed builds we’re using LucidDB driver instead of DynamoDB but they are, one and the same.

We make some minor adjustment to the SQL (quoting some tables/etc) and rerun our query and Voila, our report runs in 10s down from 40s, an improvement of 400%.

How about storage? Our storage report shows that DynamoDB is using only .3 GB to store the same 7 Million records as compared to MySQL at 1.8GB, or 1/6 of the storage.

Not a bad investment of a few minutes of time, I’d say. DynamoDB (LucidDB) takes just a few minutes to install, and because of its focus on BI you should find things like retrieving data from remote data sources easy, and effective. Let’s be truthful here as well; once you speed up a report by 400% and reduce its storage by 6x your boss will be calling you a dynamo.

Notes: Full set of scripts posted here: mysql_relief.zip. Original queries and dataset from Vadim at MySQLPerformanceBlog.

Amazon’s Pre Ordering of books sucks!

Wednesday, August 26th, 2009

I pre-ordered a copy of the new, (first, only, best, and original) Pentaho book “Pentaho Solutions” by Roland Bouman and Jos van Dongen two weeks back.  Saw from a tweet that the book was shipping from Amazon.  Cool - had a look at the page.  Sure, they can ship today if I get my order in on time so I know they can ship it.

How about my pre order, which I would assume would go out before regular orders?  Won’t ship until next week?  Delivered by 9/11/2009?  Lesson learned - don’t pre order from Amazon.  :)

CDF Tutorials

Wednesday, July 22nd, 2009

The folks at webdetails have posted their Pentaho Community Dashboard Framework tutorials that look great!  They run you through building CDF dashboards which is usually a crucial, user facing part of any BI implementations.  While much of the work is the ETL/OLAP configuration, tuning, etc on the backend most users think of Pentaho as the dashboard/reports they interact with not the data munching for the Data Warehouse.

These tutorials look great; I’ve implemented more than 20 CDF dashboards at four customers already but I still bought them to learn even more ins and outs.  You should too! No better way to learn something than from the source of the technology which in this case is Pedro and team @ webdetails.

MDX Humor from Portugal

Tuesday, April 21st, 2009

Pedro Alves, the very talented lead developer behind the Pentaho Community Dashboard Framework gave me a good chuckle with his high opinion of MDX as a language:

MDX is God’s gift to business language; When God created Adam and Eve he just spoke [Humanity].[All Members].Children . That’s how powerful MDX is. And Julian Hyde allowed to use it without being bound to microsoft.

If you haven’t checked out Pedro’s blog, definitely get over there. It’s a recent start but he’s already getting some great stuff posted.

PDI Scale Out Whitepaper

Tuesday, April 21st, 2009

I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges - one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.

None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?

I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.

Another interesting set of findings in the paper also relates to a very pragmatic approach in my research - I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.

Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:

High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)

If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.

Pentaho Partner Summit

Wednesday, April 1st, 2009

I’m at the Westin close to the event space for the summit…

I’m around tonight - meeting Bryan Senseman from OpenBI a bit later (730 or 800pm).  Anyone else around and want to meet up for dinner?  Email me ngoodman@ignorethispart.com bayontechnologies.com.

Make Mondrian Dumb

Wednesday, February 18th, 2009

I had a customer recently who had very hierarchical data, with some complicated measures that didn’t aggregate up according to regular ole aggregation rules (sum, min, max, avg, count, distinct count). Now, one can do weighted averages using sql expressions in a Measure Expression these rules were complex and they also were dependent on the other dimension attributes. UGGGGH.

Come to that: their analysts had the pristine, blessed data sets calculated at different rollups (already aggregated to Company Regions). Mondrian though, is often too smart for it’s own good. If it has data in cache, and things it can roll up a measure to a higher level (Company Companies can be rolled up to Regions if it’s a SUM for instance) Mondrian will do that. This is desirable in like 99.9% of cases. Unless, you want to “solve” your cube and just tell Mondrian to read the data from your tables.

I started thinking - since their summary row counts are actually quite small.

  1. What if I could get Mondrian to ignore the cache and always ask the database for the result? I had never tried the “cache=” attribute of a Cube before (it defaults to true and I work with that 99.9% of the world). Seems like setting it to false does the trick. Members are read and cached but the cells aren’t.
  2. What if I could get Mondrian to look to my summary tables for the data instead of aggregating the base fact? That just seems like a standard aggregate table calculation. Configure an aggregate table so Mondrian will read the Company Regions set from the aggregate instead of the fact

Looks like I was getting close to what I wanted. Here’s the dataset I came up with to test:

mysql> select * from fact_base;
+----------+-----------+-----------+
| measure1 | dim_attr1 | dim_attr2 |
+----------+-----------+-----------+
| 1 | Parent | Child1 |
| 1 | Parent | Child2 |
+----------+-----------+-----------+
2 rows in set (0.00 sec)

mysql> select * from agg_fact_base;
+------------+----------+-----------+
| fact_count | measure1 | dim_attr1 |
+------------+----------+-----------+
| 2 | 10 | Parent |
+------------+----------+-----------+
1 row in set (0.03 sec)

mysql>
Here’s the Mondrian schema I came up with:

<Schema name=”Test”>
<Cube name=”TestCube” cache=”false” enabled=”true”>
<Table name=”fact_base”>
<AggName name=”agg_fact_base”>
<AggFactCount column=”fact_count”/>
<AggMeasure name=”[Measures].[Meas1]” column=”measure1″ />
<AggLevel name=”[Dim1].[Attr1]” column=”dim_attr1″ />
</AggName>
</Table>
<Dimension name=”Dim1″>
<Hierarchy hasAll=”true”>
<Level name=”Attr1″ column=”dim_attr1″/>
<Level name=”Attr2″ column=”dim_attr2″/>
</Hierarchy>
</Dimension>
<Measure name=”Meas1″ column=”measure1″ aggregator=”min”>
</Measure>
</Cube>
</Schema>

Notice that the aggregate for Parent in the agg table is “10″ and the value if the children are summed would be “2.” 2 means it agged the base table = BAD. 10 means it used the summarized data = GOOD.

The key piece I wanted to very is that if I start with an MDX for the CHILDREN and THEN request the Parent will I get the correct value. Run a cold cache MDX to get the children values:

200902181235

Those look good. Let’s grab the parent level now, and see what data we get:
200902181235-1

The result is 10 = GOOD! I played around with access methods to see if I could get if messed up and on my simple example it didn’t. I‘ll leave it to the comments to point out any potential issues with this approach but it appears as if setting cache=”false” and setting up your aggregate tables properly will cause Mondrian to be a dumb cell reader and simply select out the values you’ve already precomputed. Buyer Beware - you’d have to get REALLY REALLY good agg coverage to handle all the permutations of levels in your Cube. This could be rough - but it does work. :) And caching - it always issues SQL so that might be an issue too.

Sample: cachetest.zip

Mondrian - you’ve been dumbed down! Take that!!!

Self Service Data Export using Pentaho

Monday, February 9th, 2009

Every BI installation has power users that just want “data dumps.” They may need the dumps for a variety of reasons:

  • You’ve built crappy reports. They can’t get the information they need in *YOUR* reports.
  • They need to feed the data into another system. They want to select all customers who bought product X in time period Y to send them a recall notice. Need a dump of email / addresses to send them the notice.
  • They are addicted to Excel; they feel like a super hero whizzing through the data making fancy graphs and doing a few of their own ratios/calculations.
  • They want to munge the numbers. They will export it to Excel, throw out the data that makes them look bad, and then present it to their boss with shiny positive results.

I had a customer who needed something to “feed the data to another system.” Their original approach was to write a Pentaho Report that formatted to CSV well, write the parameterized query, and then simply generate the report and return it in the browser. This seems like a sound approach and would have been my first as well. They found that it did work well, to a point. It looked as if the Pentaho Report layer tends to use a bunch of memory for report generation - this is understandable. The report object is being rendered but is only “returned” to Pentaho when it’s complete. The entire dataset must be in memory. Well, needless to say, with this customers heap configuration they found a row threshold (30,000) that caused their Pentaho 1.6 installation to croak.

However, they didn’t really need to be using Pentaho Reporting. Kettle, which is included in Pentaho BI Suite has an straightforward performant way to export to CSV. If we could generate that file, and then simply return the file that we just generated to the browser we’d have an elegant solution for data export.

The first piece of the puzzle is the data export KTR. This KTR takes two arguments: Country and Filename. The Country is the value that will limit the data set that we are outputing (output customers in Italy). The Filename is the location to put the file. This isn’t necessary, but it allows the FILENAME to be set by the caller (.xaction) instead of callee (ktr). It’s for convenience.

200902090955

I’ve created a directory in tomcat/webapps/ named “lz”, short for landing zone. This /lz/ directory is accessible via the web browser. By placing this in this location we can use the same tomcat server that is hosting Pentaho to serve up our data export file as well.

Now, let’s get to a little bit of the magic of the Action Sequence, data_export.xaction.
200902091000

The first thing this action sequence does is to create a list of countries, and then prompt the user to select one. This is pretty standard stuff, done all the time with Pentaho reports so we won’t cover the specifics here.

Once we’ve got our “country” defined, we call our “Pentaho Data Integration” KTR component with two arguments. The first is the country the user has just selected and the second is the filename that we’ve hard coded as an input to our action sequence. The filename is the location on the local filesystem you would like kettle to generate the file at (ie, /apps/pentaho/tomcat/webapps/lz/data_export_file.csv).

Once we’ve generated the file in that location, we’ll send a redirect to the user as the “output” of this action sequence. The user doesn’t really “see” this; the user will just see the .csv arrive in their browser. The way to get the redirect to work is to add the output to “response.redirect” like so:
200902091005
The redirect URI is another hard coded value: /lz/data_export_file.csv which should reference the path of the file on the web server.

The user experience is indistinguishable from standard reports. User is prompted:

200902091010

they click “OK” and are prompted what do do with their export.
200902091011

The performance of this solution far surpasses using Pentaho Reporting. Exports of 10,000 rows that were taking 30-60 seconds were taking 10-15 seconds. However, be warned. The export via Kettle will only have as many formatting options as are present in the “Text File Output” step which are many, but limited. If you need fine control over the format of your data export, you may have to stick with Pentaho Reporting since it does provide a superior set of layout/formatting controls.
It should also be noted that this works with zipped files (to zip up the .csv), and also .XLS exports. I’ve provided this sample (data_export.zip) that works with Pentaho 2.0 BI (Needs hypersonic sample database). You’ll have to adjust the “filename” variable to your filesystem before running it for it to work properly (it has the location of my installation).

The death of prevRow = row.clone()

Friday, January 30th, 2009

UPDATE: This step is available in Kettle 3.2 M1.

For those that have done more involved Kettle projects you’ll know how valuable the Javascript step is. It’s the Swiss Army knife of Kettle development. The calculator step is a nice thought, but the limited set of functions and the constriction of having to enter it in pulldowns can make more complex calculations more difficult.

Those that have done “observed metric” type calculations in Kettle will know this bit of Javascript well:

var prevRow;
var PREV_ORDER_DATE;

if ( prevRow != null && prevRow.getInteger(”customernumber”, -1) == customernumber.getInteger() )
PREV_ORDER_DATE = prevRow.getDate(”orderdate”, null);
else
PREV_ORDER_DATE = null;

prevRow = row.Clone();

This little bit of Javascript allowed you to “look forward” (or back depending on your sorting) and calculate the difference between items:

  • Watching a set of “balances” fly by and calculate the transactions (this balance - prev balance) = transaction amount
    Web Page duration (next click time - this click time) = time spent viewing this web page
    Order Status time (next order status time - this order status time) = Amount of time spent in this order status (warehouse waiting)

In other words, lining data up and peaking ahead and backwards is a common analytic calculation. In Oracle/ANSI SQL, there’s a whole set of functions that perform these type of functions.

This week I committed to the Kettle 3.2x source code a step to perform the LEAD/LAG functions that I’ve had to hand write several times in Javascript. It’s been long overdue as I told Matt I designed the step in my head two years ago and he’s been patiently waiting for me to get off my *ss and do something about it.

You can find more information about the step on its Wiki page, along with a few examples in the samples/transformations/ directory.

The step allows you peek N rows forward, and N rows backward over a group and grab the value and include it in the current row. The step allows you to set the group (at which to reset the LEAD/LAG), and setup each function (Name, Subject, Type, N rows)
200901301239
Using a group field (groupseq) and LEADing/LAGing ONE row (N = 1) we can get the following dataset:
200901301238
Any additional calculations (such as the difference, etc) can be calculated like any other fields.

This was my first commit to the Kettle project, and a very cool thing happened. I checked in the base step and in true open source fashion, Samatar (another dev) noticed, and created an icon for my step which was great since I had no idea what to make as the icon. Additionally, hours after my first commit he had included a French translation for the step. He and I didn’t discuss it ahead of time, or even know each other. That’s the way open source works… well. :)
RIP prevRow = row.clone(). You are dead to me now. Long live the Analytic Query step