Category Archives: Grid/Distributed Computing

Pushdown Query access to Hive/Hadoop data

Just in time for the Hadoop Summit, we’ve updated some pieces of LucidDB to be able to do “more” with the Hive driver. We’ve had the ability to connect and query data inside of Hive for about a year already. The initial implementation allowed people to:

  • Use Hive for it’s massive scalability, distributed data processing capabilities.
    Hive is great at processing huge amounts of data. It’s scales to hundreds of servers, and has a bunch of fantastic features for structured and semi structured data processing, fault tolerance, etc. Hive is a great way to do heavy lifting sorting through petabytes of data to arrive at some interesting, pre-aggregated datasets.
  • Cache the output of Hive views into LucidDB.
    Now when we’re talking about taking the output of Hive views into LucidDB, we’re not talking about SMALL datasets (10k rows) we’re talking about 50 or 100, or 500 million rows. Some might think that number is small (by Hive standards it often is) and others might think that’s big (our entire DW is only 200 million rows). However, LucidDB has provided the ability to draw in data from Hive via easy MERGE/INSERT statements.

You can see some background on this integration that has been functional since August 2010 on Eigenpedia: http://pub.eigenbase.org/wiki/LucidDbHive

Also, a video I recorded last year showing the basic integration: (THIS IS NOT NEW!!!!).

Why the blog now? We’ve done a couple of things over the past while.

  • We’ve done some work on LucidDB (yet to be committed and will be a POST 0.9.4 commit) that allows the use of Hives, well, somewhat unique driver. Hive’s driver has a bunch of quirks in terms of metadata, etc that we’re now recognizing and using properly over in LucidDB.
  • We’ve updated to the 0.7.0 release. We’re now ready to go with the latest and great Hive features.
  • We’ve enabled some pushdowns to work to allow for easier day to day loading of LucidDB tables from Hive, along with a limited workload of Ad Hoc SQL access.

Our vision for Big Data dictates the need for:

  • Live, real time, per query access to the Big Data system that is useful and practical (ie, filters, etc).
    This means that you need to be able to allow the user, via simple Parameter or simply by hitting a different schema or table access to the live data.
    201106242121.jpg  
  • Easy, full ANSI SQL access to high performance, low latency, aggregated data.
    Dashboards need results that come back in seconds, not minutes. LucidDB and the data cached there provide a great “front end” for easily doing real BI work on top of data that sits inside Hive.

We’ve updated our connectors to allow some filtering/projection pushdowns to work with Hive.

Here’s a simple example. We have a report or dashboard which is looking for only a subset of data in Hive. We want to allow the filtering of data to occur and for Hive to receive the filtering from our OLAP/Dashboard.

By default LucidDB will read the entire table and do all SQL functions over in our

201106241757.jpg

However, pulling over the entire table is really not going to work well for us. This would really be the worst of both worlds; you’d be better off just querying Hive directly. However, luckily we’ve enabled some pushdowns be pushed down to Hive.

201106241800.jpg

Notice that the condition IN( values ) is being pushed down to the remote Hive server.

Let’s try something a bit more complex!

201106241803.jpg

Currently, we’re able to push down most filters and projections.

Let’s take now take the use case where we’re trying to ONLY UPDATE records that have been updated since the last time we checked (ID > 97). More likely the key that we’d use to do this push down filter would be a date, but you can simply use your imagination.

Consider the following SQL:

merge into hive_test.”local_POKES” l
using (select * from hive_test.pokes where “foo” > 97)
ON l.”foo” = p.”foo”
when matched then update set “bar” = p.”bar”
when not matched then insert (“foo”, “bar”) values (p.”foo”, p.”bar”);

This SQL is a typical “incremental” load from a remote system. Syntactically a little dense, but it’s actually a VERY high performance method to load directly into LucidDB often eliminating the need entirely to draw the data through an intermediate server and process (ETL Tool).

201106241815.jpg

Our enhancements allow the Hive portion to be pushed down. Hive will ONLY return values greater than 97 and we’ll simply intelligently keep any changed records “up to date” in LucidDB for reporting.

Many of these changes will be in a patched version of LucidDB; we’ll make this patched release available to any customers who want these optimizations available, immediately for use with Hive. Let us know what you think by joining this conversation at the LucidDB forums: Hive Connector Pushdown

In a subsequent blog we’ll cover how to now match up data coming from Hive (or CouchDB) with data in other systems for reporting.

PDI Scale Out Whitepaper

I’ve worked with several customers over the past year helping them scale out their data processing using Pentaho Data Integration. These customers have some big challenges – one customer was expecting 1 billion rows / day to be processed on their ETL environment. Some of these customers were rolling their own solutions; others had very expensive proprietary solutions (Ab Initio I’m pretty sure however they couldn’t say since Ab Initio contracts are bizarre). One thing was common: they all had billions of records, a batch window that remained the same, and software costs that were out of control.

None of these customer specifics are public; they likely won’t be which is difficult for Bayon / Pentaho because sharing these top level metrics would be helpful for anyone using or evaluating PDI. Key questions when evaluating a scale out ETL tool: Does it scale with more nodes? Does it scale with more data?

I figured it was time to share some of my research, and findings on how PDI scales out and this takes the form of a whitepaper. Bayon is please to present this free whitepaper, Pentaho Data Integration : Scaling Out Large Data Volume Processing in the Cloud or on Premise. In the paper we cover a wide range of topics, including results from running transformations with up to 40 nodes and 1.8 billion rows.

Another interesting set of findings in the paper also relates to a very pragmatic approach in my research – I don’t have a spare 200k to simply buy 40 servers to run these tests. I have been using EC2 for quite a while now, and figured it was the perfect environment to see how PDI could scale on the cheapest of cheap servers ($0.10 / hour). Some other interesting metrics, relating to Cloud ETL is the top level benchmark of a utility compute cost of ETL processing of 6 USD per Billion Rows processed with zero long term infrastructure commitments.

Matt Casters, myself, and Lance Walter will also be presenting a free online webinar to go over the top level results, and have a discussion on large data volume processing in the cloud:

High Performance ETL using Cloud- and Cluster-based Deployment
Tuesday, May 26, 2009 2:00 pm
Eastern Daylight Time (GMT -04:00, New York)

If you’re interested in processing lots of data with PDI, or wanting to deploy PDI to the cloud, please register for the webinar or contact me.

First 100 Million Rows done in the "cloud"

My good friend, Matt Casters, posted his results from what we believe to be the first 100 Million Rows of data processed by an ETL tool in the new cloud computing paradigm.  Matt Casters ran a simple 100 Million rows through Kettle on Amazon EC2.

I should really do a write up or review of EC2.  I’m LOVIN’ it and others I’ve introduced to it are LOVIN’ it too!  I just need some spare time (ha ha ha) to write it up.

Jini, the silent coming of age

I have always been fascinated by Jini. I’ve attended two of the Jini community meetings and kicked the tires on many of the research projects at jini.org. In many ways it was a technology ahead of it’s time and since it didn’t make a huge SPLASH during the dot com boom, it hasn’t been adopted en masse.
NOTE: many of these links may require registration for the jini.org community and acceptance of the Sun SCSL (which is also a deterrent to the growth of this wonderful technology)

There are many community members much more familiar than I am on the state of Jini adoption. However I do continue to hope for something to happen with regards to it’s uptake. It still seems to be very much on the fringe. Surprising, there are fortune 100 companies using Jini.

There are more airline reservations made on Jini based systems (orbitz, aa.com, nwa.com) than any other electronic system (according to some information from Orbitz). They even won the

Duke’s Choice Awards — Orbitz has been selected as a winner of the 2004 Duke’s Choice Award, recognizing the “best of the best” from among all the cool projects going on in the world of Java technology. Orbitz team members are presenting TS-2614 at JavaOne. See why they won this award.

It’s a good technology that didn’t originally come with the whizbang set of installation wizards that the current frenzy of the dot com era required. It originally required the skill and aptitude of distributed computing engineers to recognize it’s benefits which were firmly placed on the fringe. Some of the projects that are being built on top of Jini offer some great additions to problems that Jini has a competitive advantage in solving.

My personal interest, and if I ever have one of those things that consultants refer to as “Extended Research Periods” would be of how it could address some of the data warehouseing issues I face on a day to day basis. As someone knowledgable with the domain knowledge of a problem (OLAP, huge fact tables, distributed query processing) could I use the current Jini technology and enhancements (such as rio, computefarm, etc) to build a wonderful distributed BI infrastructure? :) Stay tuned… perhaps I’ll have one of those periods of time coming up!

Distributed Database Systems

dds_coverI am currently making my way through the book, “Principles of Distributed Database Systems.” It reads like an academic course textbook, as I imagine was the authors intent.

I find it fascinating… It is also a bit challenging to try and remember what it’s like to be learning new notations and abstracted academic concepts. My day to day is so grounded in building customer solutions (very practical, with good applied technique and concepts) that I have to be deliberate to keep the mind sharp.

One thing that I’m particularly enjoying about this book is that I am seeing some of the concepts that I use my role as a BI Consultant from their starting points. I’m accustomed to interpreting Oracle plans, statistics, etc and now I’m able to relate that to the abstract concepts they represent.

I have to admit though, my interest does have a particular project in mind… I’m always wanting to build something that is more clever, and better than what’s out there now… This book might help me solidify some of those thoughts and add direction to my company R&D focus.