Archive

Archive for the ‘Open Source’ Category

MDX Humor from Portugal

April 21st, 2009

Pedro Alves, the very talented lead developer behind the Pentaho Community Dashboard Framework gave me a good chuckle with his high opinion of MDX as a language:

MDX is God’s gift to business language; When God created Adam and Eve he just spoke [Humanity].[All Members].Children . That’s how powerful MDX is. And Julian Hyde allowed to use it without being bound to microsoft.

If you haven’t checked out Pedro’s blog, definitely get over there. It’s a recent start but he’s already getting some great stuff posted.

General BI, Open Source, Pentaho

PDI Scale Out Whitepaper

April 21st, 2009

I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges - one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.

None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?

I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.

Another interesting set of findings in the paper also relates to a very pragmatic approach in my research - I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.

Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:

High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)

If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.

Data Integration (Kettle), General BI, Grid/Distributed Computing, Open Source, Pentaho

Pentaho Partner Summit

April 1st, 2009

I’m at the Westin close to the event space for the summit…

I’m around tonight - meeting Bryan Senseman from OpenBI a bit later (730 or 800pm).  Anyone else around and want to meet up for dinner?  Email me ngoodman@ignorethispart.com bayontechnologies.com.

Open Source, Pentaho

Make Mondrian Dumb

February 18th, 2009

I had a customer recently who had very hierarchical data, with some complicated measures that didn’t aggregate up according to regular ole aggregation rules (sum, min, max, avg, count, distinct count). Now, one can do weighted averages using sql expressions in a Measure Expression these rules were complex and they also were dependent on the other dimension attributes. UGGGGH.

Come to that: their analysts had the pristine, blessed data sets calculated at different rollups (already aggregated to Company Regions). Mondrian though, is often too smart for it’s own good. If it has data in cache, and things it can roll up a measure to a higher level (Company Companies can be rolled up to Regions if it’s a SUM for instance) Mondrian will do that. This is desirable in like 99.9% of cases. Unless, you want to “solve” your cube and just tell Mondrian to read the data from your tables.

I started thinking - since their summary row counts are actually quite small.

  1. What if I could get Mondrian to ignore the cache and always ask the database for the result? I had never tried the “cache=” attribute of a Cube before (it defaults to true and I work with that 99.9% of the world). Seems like setting it to false does the trick. Members are read and cached but the cells aren’t.
  2. What if I could get Mondrian to look to my summary tables for the data instead of aggregating the base fact? That just seems like a standard aggregate table calculation. Configure an aggregate table so Mondrian will read the Company Regions set from the aggregate instead of the fact

Looks like I was getting close to what I wanted. Here’s the dataset I came up with to test:

mysql> select * from fact_base;
+----------+-----------+-----------+
| measure1 | dim_attr1 | dim_attr2 |
+----------+-----------+-----------+
| 1 | Parent | Child1 |
| 1 | Parent | Child2 |
+----------+-----------+-----------+
2 rows in set (0.00 sec)

mysql> select * from agg_fact_base;
+------------+----------+-----------+
| fact_count | measure1 | dim_attr1 |
+------------+----------+-----------+
| 2 | 10 | Parent |
+------------+----------+-----------+
1 row in set (0.03 sec)

mysql>
Here’s the Mondrian schema I came up with:

<Schema name=”Test”>
<Cube name=”TestCube” cache=”false” enabled=”true”>
<Table name=”fact_base”>
<AggName name=”agg_fact_base”>
<AggFactCount column=”fact_count”/>
<AggMeasure name=”[Measures].[Meas1]” column=”measure1″ />
<AggLevel name=”[Dim1].[Attr1]” column=”dim_attr1″ />
</AggName>
</Table>
<Dimension name=”Dim1″>
<Hierarchy hasAll=”true”>
<Level name=”Attr1″ column=”dim_attr1″/>
<Level name=”Attr2″ column=”dim_attr2″/>
</Hierarchy>
</Dimension>
<Measure name=”Meas1″ column=”measure1″ aggregator=”min”>
</Measure>
</Cube>
</Schema>

Notice that the aggregate for Parent in the agg table is “10″ and the value if the children are summed would be “2.” 2 means it agged the base table = BAD. 10 means it used the summarized data = GOOD.

The key piece I wanted to very is that if I start with an MDX for the CHILDREN and THEN request the Parent will I get the correct value. Run a cold cache MDX to get the children values:

200902181235

Those look good. Let’s grab the parent level now, and see what data we get:
200902181235-1

The result is 10 = GOOD! I played around with access methods to see if I could get if messed up and on my simple example it didn’t. I‘ll leave it to the comments to point out any potential issues with this approach but it appears as if setting cache=”false” and setting up your aggregate tables properly will cause Mondrian to be a dumb cell reader and simply select out the values you’ve already precomputed. Buyer Beware - you’d have to get REALLY REALLY good agg coverage to handle all the permutations of levels in your Cube. This could be rough - but it does work. :) And caching - it always issues SQL so that might be an issue too.

Sample: cachetest.zip

Mondrian - you’ve been dumbed down! Take that!!!

Open Source, Pentaho

The death of prevRow = row.clone()

January 30th, 2009

UPDATE: This step is available in Kettle 3.2 M1.

For those that have done more involved Kettle projects you’ll know how valuable the Javascript step is. It’s the Swiss Army knife of Kettle development. The calculator step is a nice thought, but the limited set of functions and the constriction of having to enter it in pulldowns can make more complex calculations more difficult.

Those that have done “observed metric” type calculations in Kettle will know this bit of Javascript well:

var prevRow;
var PREV_ORDER_DATE;

if ( prevRow != null && prevRow.getInteger(”customernumber”, -1) == customernumber.getInteger() )
PREV_ORDER_DATE = prevRow.getDate(”orderdate”, null);
else
PREV_ORDER_DATE = null;

prevRow = row.Clone();

This little bit of Javascript allowed you to “look forward” (or back depending on your sorting) and calculate the difference between items:

  • Watching a set of “balances” fly by and calculate the transactions (this balance - prev balance) = transaction amount
    Web Page duration (next click time - this click time) = time spent viewing this web page
    Order Status time (next order status time - this order status time) = Amount of time spent in this order status (warehouse waiting)

In other words, lining data up and peaking ahead and backwards is a common analytic calculation. In Oracle/ANSI SQL, there’s a whole set of functions that perform these type of functions.

This week I committed to the Kettle 3.2x source code a step to perform the LEAD/LAG functions that I’ve had to hand write several times in Javascript. It’s been long overdue as I told Matt I designed the step in my head two years ago and he’s been patiently waiting for me to get off my *ss and do something about it.

You can find more information about the step on its Wiki page, along with a few examples in the samples/transformations/ directory.

The step allows you peek N rows forward, and N rows backward over a group and grab the value and include it in the current row. The step allows you to set the group (at which to reset the LEAD/LAG), and setup each function (Name, Subject, Type, N rows)
200901301239
Using a group field (groupseq) and LEADing/LAGing ONE row (N = 1) we can get the following dataset:
200901301238
Any additional calculations (such as the difference, etc) can be calculated like any other fields.

This was my first commit to the Kettle project, and a very cool thing happened. I checked in the base step and in true open source fashion, Samatar (another dev) noticed, and created an icon for my step which was great since I had no idea what to make as the icon. Additionally, hours after my first commit he had included a French translation for the step. He and I didn’t discuss it ahead of time, or even know each other. That’s the way open source works… well. :)
RIP prevRow = row.clone(). You are dead to me now. Long live the Analytic Query step

Data Integration (Kettle), Open Source, Pentaho

An arms race my customers don’t care about

November 26th, 2008

Perfect is the enemy of good enough. This is fertile soil for why people choose to use the simpler, functional, cheaper open source cousins of proprietary feature function behemoths. Don’t get me wrong - too few features / crappy performance you lose customers because you’re not helping people solve problems if you lack too many features.

Recently, I observed a thread at the blog of Goban Saor entitled “Open Source Metrics.”

It basically has turned into a discussion which keeps creeping up about which tool is faster: Talend or Kettle. Which leads me to ask the question: Who Friggin’ Cares?

I’m a Kettle Expert so I think Kettle is Wicked Fast.
If I were a Talend Expert I’d think Talend is Wicked Fast.

Performance for customers who are focused on results, and aren’t technophiles boils down to these two requirements

  1. It has to meet my performance requirements for my project. If I have to load 1 million records per day and I have 10 minutes to do that then the tool either does or does not meet that performance requirement.
  2. It has to allow me to grow beyond my current performance requirements. I am loading 1 million records now, but in 3 years I may be loading 100 million records. Given the right investment in tuning and scaling I don’t want to have to change to a different tool when I go much bigger.

For Kettle the answer is pretty simple:

  1. I do a few simple mappings, hit run, do very little tuning/database optimization. Wham-o. 20k records / second throughput. Look and notice Kettle is simply sitting idle waiting for a database lookup. Add an index. Wham-o 35k records / second throughput. Have extra CPUs, fire up a few extra threads of a calculation step. Wham-o 40k / second. Surpasses customer batch window needs sufficiently; enough said. Requirement met - whether 35k records per second is slower or faster than someone else is irrelevant. Requirement met.
  2. This usually involves outside validations. What are other people doing - what are the proof points about the performance. I personally have worked on a Kettle scale out cluster with 5 nodes that reads, sorts, aggregates, and summarizes a billion FAT (wide character) records in an HOUR and scales almost perfectly linearly (* no tool grows at perfect linear). Telling a customer using the exact same binary you have there, you can scale out and process hundreds of millions into billions of records per hour. Requirement met - you can grow with your tool.

I think Kettle performance is superb. I’d welcome Talend folks to comment here and blog about their proof points for how Talend performance is superb. I believe that it is. Let’s just all consider the most important thing: open source ETL is about solving the ETL need well, not necessarily incremental performance differences.

It’s a debate with no winner. I don’t care if your tool is 2.5% faster at reading character text files than mine. I do care if it can scale out (requirement 2) and solves customer problems (requirement 1).

Open Source

Hidden little trend arrows

November 11th, 2008

Many readers of this blog use JPivot. The solidly average web based Pivot Viewer that I’ve heard described as a “relic” of the cold war - no frills utility software. However, as maligned as JPivot is, it does have some great features and has been production quality software for years now. One of these hidden little features that is in JPivot (and also in Pentaho) is the quick and easy way to add trend lines to a JPivot screen by simply using MDX.

Consider, for instance, this little bit of MDX:

with member [Measures].[Variance Percent] as ‘([Measures].[Variance] / [Measures].[Budget])’, format_string = IIf(((([Measures].[Variance] / [Measures].[Budget]) * 100.0) > 2.0), “|#.00%|arrow=’up’“, IIf(((([Measures].[Variance] / [Measures].[Budget]) * 100.0) < 0.0), “|#.00%|arrow=’down’“, “#.00%”))
select NON EMPTY {[Measures].[Actual], [Measures].[Budget], [Measures].[Variance], [Measures].[Variance Percent]} ON COLUMNS,
NON EMPTY Hierarchize(Union({[Positions].[All Positions]}, [Positions].[All Positions].Children)) ON ROWS
from [Quadrant Analysis]

which produces this lovely set of arrows letting the user know how their individual variance value rates in terms of KPI thresholds.

200811111457

The secret of course, is the arrow= tag in the format string. Easy enough. “up” is a green up arrow. “down” is a little red arrow. “none” is no arrow.

Happy Visual Cue Indicator day to you all.

Open Source, Pentaho

Oracle ACE: In Absentia

September 26th, 2008

So… A few years back I spent a LOT of time with Oracle ETL and BI products. I learned them inside and out, gave some user conference presentations, wrote a bunch of blogs, even Alpha tested a version of Oracle Warehouse Builder. Then I found “Open Source BI” and I’ve been heading breakneck into the world of MySQL, Pentaho, … A choice I do NOT regret - my consultancy is busier than ever and I love the Open Source BI play.

However - I miss seeing some of the old Oracle peeps at Open World. This year, I even registered for my free ACE pass to OOW but didn’t make it because I started two new projects this week. What I realized this year, was that I’m WAY out of touch with what’s going on in the land of Big Red O. The words and products for BI whiz past me - they don’t even look anything like they did just a couple of years back.

I hope everyone had a good time at OOW this year! I don’t see a path back to the land of Oracle anytime soon for me. :(

Open Source, Oracle, Professional

Business Intelligence: Experience vs Sexy

July 24th, 2008

A couple of postings over the past few days that prompted me to put some digital pen to paper so to speak. The first was a post by L. Wayne Johnson who works for Pentaho who I had the pleasure to meet last week in Orlando entitled “Is it just sexy?” The second was by a Ted Cuzzillo over at datadoodle.com entitled “Tableau is the new Mac” Both share important perspectives that deserve some more light.

First, we have to start with a premise that leads you to see why there are two somewhat divergent paths that products/people/companies are taking. BI is now a commodity. The base technology components for doing BI (reports, dashboards, OLAP, ETL, scheduling, etc) is commodotized. Someone once told me that once Microsoft enters and nails a market, you know it’s been commodotized and based on the success of MSAS/DTS/etc you can tell that MSFT entered long ago and nailed it. So, if you don’t believe that the raw technology for turnings data into information is essentially commodotized then you should stop reading now. The rest will be useless to you.

What happens when software becomes a commodity? There’s usually a mid market but you start to see players emerge at two ends of a spectrum.

Commodity End (Windows, Open Office, linux, Crystal Reports):

  • Hit the good side of the features curve. Definitely stay on the good side of the 80/20 rule.
  • Focus on lots and lots of basic features. You’re trying to appeal to lots and lots of people. If you’re pipe isn’t 1000x bigger than the other market you are toast.
  • Provide a “reasonable” quality product. To use a car metaphor, you build an automatic transmission car with manual windows. The lever to open and close the window doesn’t usually fall off and if it does, you’ve already put 100,000 miles on the car.
  • Treat the user experience as one category in “Features.” Usability is something you build so that customers don’t choose the other guy over you - it’s not core to your business, you just have to provide enough for them to be successful and not hate your product.
  • Sell a LOT of software. Commodity End of a market is about HIGH VOLUME (you should sell at least one or two orders of magnitude more than the experience end) - however, people looking for “reasonable commodity” products are cheap. They want low prices so this also means your MARGINs are lower. Commodity selling is about HIGH VOLUME, LOW MARGIN business. (Caveat: not always true).

Experienced Based (Mac, iPhone, Crystal XCelcius):

  • The good side of the 80/20 rule still applies. Experience based doesn’t always mean 100% high end, every bell and whistle.
  • Focus on features that matter to the user doing a job. If a feature is needed to help a customer nail a part of their using your product it, add it and make it better than they expect. Lacking features isn’t a bad thing if you keep adding them - for instance the iPhone was LAME feature for feature initially (no GPS, battery was a pain, etc) but users were patient.
  • Provide a high quality product that is as much about using as doing. The experienced based product says that it’s not enough to have a product that does what you want, but it has to be something you ENJOY using.
  • User and Experience is KING. Usability is not something that is a feature to implement, it’s the thing that informs, prioritizes and determines what features are implemented.
  • Sell some software. In order to get the driving experience a user wants (BMW 700x series) they are willing to pay for it. It’s a higher margin business and there’s no secret that if someone is looking for something that both works, and they LOVE to use then it’s worth more to them. It’s a LOWER VOLUME, HIGHER MARGIN business. (Caveat: not always true - things are relative. iPod is higher margin but also high volume).

So… Let’s get back to the point on BI. I’ve built some sexy BI dashboards for customers that look great, including some recent ones based on the Open Flash Chart library. However, I come more from the Data Warehouse side of the house so more of my time is spent on ETL, incremental fact table loads, etc. I understand that you have to have a base of function/feature to have a fighting chance on the experience side.

Sexy isn’t “just sexy” if done right. When done right, Sexy is called “Great Experience.”

Experience is about creating something that people want to use. People are happier with a software product when they enjoy using it. For instance, Ted refers to Tableau as “a radically new product.” I’ve seen it and it’s a GREAT experience, with some GREAT visualization but there’s nothing REVOLUTIONARY about it except for the experience. It’s not in the cloud, it’s not scaling beyond the petabytes, it’s not even a web product (it’s a windows desktop APP). Not revolutionary, just GREAT to use.

Tableau is an up and comer for taking something commoditized (software to turn data into insight) and making it fun to use and leaving users with a desire for more. Kudos to Tableau.

What about on the commodity side - that’s where players like Pentaho come in. They’ve built something that meets a TON of needs for a TON of customers and does so at a VERY VERY compelling price (free on open source side, or subscription for companies). Recall, Pentaho is the software that I use day in and day out to help customers be successful - and they are consistently. Pentaho is earnestly improving their usability that matches up with the philosophy of Usability is a category of features. Sexy is just Sexy for the kind of business and market they are trying to build. They want to make things look nice to be usable and help people do their job well but they’re not going to spend man years on whizbang flash charts. The commodity end is a great business model - Amazon.com is pointed about their business model of “pursuing opportunities with high volume and low margins and succeeding on operational excellence.” I consider Pentaho a bit more revolutionary than Tableau - it’s 100% platform independent and the rate at which open source development clips IS REVOLUTIONARY.

Pentaho is an up and comer for taking something commoditized (software to turn data into insight) and making it easy to obtain, inexpensive to purchase, and feature rich. Kudos to Pentaho.

Both sides of the market are valid. There’s a Dell and an Apple. There’s BMW and Hyundai - both are equally important to the markets they serve and the same is true for BI as a market.

PS - I do agree with L. Wayne Johnson that there can be sexy that is “just sexy.” A whizbang flash dial behind questionable data is pretty lame, or an animation that adds nothing to the data (see this Flash pie chart for an example of a useless sexy animation) The point being that if you consider the “antee” for the BI game at “good data” then the experience/feature sets/approach is what separates the market.

General BI, Open Source, Pentaho, Technology Industry

Ordered Rows in Kettle

June 25th, 2008

There was a question posed the other day on the Pentaho forums about how to get Kettle to process “all the rows” at one step before beginning execution on the others. Sven suggested to use the “execute once for every row” as a solution which I think is probably overall, a cleaner way to accomplish a multistep process. However, it is possible to do this in Kettle now.

The solution is to add “Blocking Step”s in your transformation where you need the whole thing to have completed before continuing processing.

Consider the following example:

200806251534

The step “block1″ does not pass rows to Step2 until all rows have finished at Step1. This accomplishes the desired outcome of ensuring that all records have completed processing on step1 before step2 processes. The example transformation outputs to the debug log and it’s clear that they are output in the correct order.

2008/06/25 15:25:04 - step1.0 - Step1:1
2008/06/25 15:25:04 - step1.0 - Step1:2
2008/06/25 15:25:04 - step1.0 - Step1:3
2008/06/25 15:25:04 - step1.0 - Step1:4
2008/06/25 15:25:04 - step1.0 - Step1:5
...
2008/06/25 15:25:05 - step1.0 - Step1:499
2008/06/25 15:25:05 - step1.0 - Step1:500
...
2008/06/25 15:25:05 - step2.0 - Step2:1
2008/06/25 15:25:05 - step2.0 - Step2:2
2008/06/25 15:25:05 - step2.0 - Step2:3
2008/06/25 15:25:05 - step2.0 - Step2:4
2008/06/25 15:25:05 - step2.0 - Step2:5
...
2008/06/25 15:25:05 - step2.0 - Step2:499
2008/06/25 15:25:05 - step2.0 - Step2:500
...
2008/06/25 15:25:05 - step3.0 - Step3:1
2008/06/25 15:25:05 - step3.0 - Step3:2
2008/06/25 15:25:05 - step3.0 - Step3:3
2008/06/25 15:25:05 - step3.0 - Step3:4
2008/06/25 15:25:05 - step3.0 - Step3:5
2008/06/25 15:25:05 - step3.0 - Step3:6
2008/06/25 15:25:05 - step3.0 - Step3:7
2008/06/25 15:25:05 - step4.0 - Step4:1
2008/06/25 15:25:05 - step3.0 - Step3:8
2008/06/25 15:25:05 - step4.0 - Step4:2
2008/06/25 15:25:05 - step3.0 - Step3:9
2008/06/25 15:25:05 - step4.0 - Step4:3
2008/06/25 15:25:05 - step4.0 - Step4:4

Example here: ordered_rows_example.ktr

Data Integration (Kettle), How To, Open Source, Pentaho