It this thing still on? Bueller, Bueller

My last post, was early 2013.  That’s more than FIVE years ago.  What, you may ask, has been keeping me away from technical musings here?  Two things.

  1. I haven’t felt like it.  I started blogging in 2004.  When I took a break, I had been blogging for almost 10 years.  That’s a LONG time.
  2. I’ve focused any additional time outside of my direct work to being a dad…. Yeah, my daughter turned 5 earllier this year.  It’s been a busy 5 years.  😉

Now I’m just now getting the itch to publish some stuff.  I have been DOING a ton.  There’s some Machine Learning work, some ETL as a service stuff, some IoT things, etc.  I am considering moving to medium though…

At first, it’s A then B testing not A/B testing

Had a great conversation today with Aydin Ghajar; a guy who really knows his stuff when it comes to building viral loops for consumer services.  The issue of rates/conversions yield big differences in ultimate outcomes for viral systems.  One of the things I’ve been grappling with thinking about A/B testing in MVPs is the complexity this adds early on.  After all, Lean is about doing things efficiently to learn what you need to match a customer/market need.

What happens when the basics to install the ability to learn how well A or B (or C or D) is non trivial?  A/B testing that isn’t trivial website button color, text takes some effort (days, but still).  My friend made an astute observation (paraphrased):

At first, your A/B testing while you’re still validating early assumptions is NOT parallel A/B testing (50 visitors this week into TWO 25 visitor buckets) but rather sequential: A is done one week, evaluated.  B is done week after, evaluated.

Why?  First off, A/B testing is complicated and best deferred until a bit later in MVP.  Why else? At first, when you’re testing your early hypothesis your volumes are small enough that you’re decreasing the accuracy of and time to determine if A or B is true.

In our example, if we get 50 users per week and we split them we’ll get 25 users in A and 25 users in B per week.  Week one, we get 25 of each, and get an action or two in each.  The statistic relevance or confidence of 25 user sample is LOW; it’s probably not enough to validate or invalidate the effectiveness or value of A or B (whatever you’re testing).  So, next week you get another 25 in each.  At 50 users, you’re confidence increases and now you’re ready to make a real determination on if A or B worked.

What’s the problem?  When you’re early on, getting things figured FAST is important.  While not right for every thing, I’d much rather have tried A in one week (50 sample size), find the results, and move on to B.  Same learning (faster in some cases) and more efficient (aka cheaper) initially.

So, I’m in perfect agreement with my friend.  A/B testing happens sequentially at first, then in parallel once practical (implementation/sample size).

Lean Startup Method improves my showers!

I’ve been doing a little bit of reading, and opted for a quick weekend in LA to give myself a crash course in the Lean Startup method.  Why?  I wanted to develop some of my own abilities and instincts to move faster to a product/market fit.  In my review of the last 3 years starting an open source DB company, I realized I was WILDLY under informed on what customers really wanted, needed.  I am determined that, whatever I do next, I will spend MUCH more time working on customer development and product market fit than feature and solution development.  The Lean Startup Method allows you work through, in a rough and tumble way, to QUICKLY disqualify your current business model/solution as viable.

That’s a pretty stock answer/elevator pitch for the Lean Startup method.  What in the world does it have to do with my showers?

There are these people; inventors, entrepreneurs, creative types that get ideas ALL the time.  They get an idea, they think about it’s place in the world.  They think how’d they build this product.  They think about what it would look like.  How awesome it would be, etc.  Many, myself included, typically see problems/solutions all the time.  At the moment, I have no fewer than 15 ideas for new business that vary from Food Preparation, to Easy Re-ording of Groceries, to population/search analytics, to improving pace of play on golf courses, utility trading, etc.

These ideas percolate; I think about them in the shower.  I think about them in the car.  They pick at me; I think about how great they could be.  I figure out a piece of the puzzle, put that into a mental cabinet and then go about my day.  Days later, it pops back into my head, and I start working on another aspect of the idea.  I call this my own Entrepreneurial Porn.  It’s fluffy, stimulating but is a lazy time consuming alternative to actually building a business.  I only stop thinking about an idea after a year or so has passed since I thought about it, or I find some critical problem with the SOLUTION that means it won’t work.  This comes after a significant amount of mental power/effort is expended.

Here’s the real curse:  I invent new ideas (into the queue) faster than they head out (solution won’t work).  I turn over and spend many of my waking hours (probably sleeping ones too) thinking about, and creatively looking upon ideas that will go nowhere.  Emotionally I feel like I’ve wasted this time and feel guilty about it, and it increases frustration internally knowing I’m not building a great business because I’m spending too much time on Entrepreneurial porn.

My hope now is that Lean Startup Method will provide  me the ability to banish these ideas from my shower time, driving time, and other moments that I could be thinking about  viable businesses, my loved ones and be a more present during the day.  I now have the ability to spend a few hours (or day or two) to figure out why the idea won’t work; then I don’t have to spend the thoughtless and wasted hours of time mulling over the solution.

Point in case on the curse; while I write this blog from 38,000 ft I just got a new idea inspired by observing a woman using her Windows Tablet.  One more idea to try and exorcise!  Entrepreneurial Devils be Damned!

DynamoBI is dead, long live LucidDB!

To our Partners, Employees, Customers, Friends, and Community:

It is my unfortunate duty to inform all of you that DynamoBI is ceasing commercial operations October 31, 2012; we are immensely grateful for all the support that you all have shown our company, in so many different ways, over the past 3 years and we hope to make this shutdown as painless as possible for all involved.  We know that we are not the only people who are invested in LucidDB, so we wanted to explain our rationale for shutting down along with the implications for the entire LucidDB community (not just our customers).

We started DynamoBI 3 years ago when we saw our most favorite open source project, LucidDB, finding limited prospects for adoption without a growth to full, commercial support which many (most!) companies need to be able to adopt open source software.  We had been actively working with LucidDB for a long while, and knew that it is a fantastic piece of database/analytic software; to say that it’s a gem and provides some amazing capabilities in an open source package is an understatement.

However, markets and businesses are not quite as simple as having a great open source project and community.  I think separately I may blog about the lessons learned from this startup (the entrepreneurial badge of honor #fail blog) but the community deserves to know that, for the most part, the failure to achieve success was about the market and selling environment (and our successes here) than any innate defects in LucidDB.

In short, we were not successful in the marketplace for two primary reasons:

1) In a crowded, loud market of more than 40+ Analytic data storage solutions, raw single query speed remains the singular priority.  LucidDB often well improved over MySQL/Oracle but was not as fast as our Analytic peers.  All of our other very interesting and compelling features (versioning of data, EII type connectors, pluggable/extensible systems) were often not even evaluated as we were often eliminated from evaluation based on the single raw query speed.  LucidDB performs as advertised (great BI database, much faster than what you’re currently using), but that wound up not being enough.

2) Open Source price points are compelling for customers, but work only if you can build a high volume business.  It became clear earlier this year (even with building enough cash flow to pay full time staff, etc!) that the size of our “funnel” was not large enough to support a high growth, interesting business.  We determined that if we had X number of downloads we ended up with Y prospects that converted to Z customers at price A.  We experimented with price, offering, prospect development, etc.  We improved our conversion rates over time, but ultimately found that unless we could find some way to increase the mouth of the funnel by more than 100x we wouldn’t have a growing business that would allow us to continue/further our investment in LucidDB.

There are other reasons as well, many of which are missteps or mistakes by me personally.  That could fill an entire other blog (and likely will at some point).

We’ve been working with our customers over the past few months to help them prepare for the future with us no longer providing the customer support.  We’ve been communicating this message to them, and now we’re bringing it to the greater community about our future participation in LucidDB.

DynamoBI will:
1) Host the git repositories and continue to provide a legal contribution framework so that the IP for the project remains clean for all.  The Apache license means that DynamoBI remains free and accessible for anyone/everyone wishing to use it (or parts of it).
2) Contribute any “interesting” pieces of the amazing framework to projects that can use it.  In particular, we’re thrilled to see the Optiq project leveraging, as a starting point, some of the LucidDB components.
3) Host the forums, and wiki, and issue tracking for the LucidDB community as we have been for the past few years (http://luciddb.org).
4) Continue to participate as active users in the community; we are still fond of LucidDB and hope to see the community/project be successful.

However, DynamoBI will no longer:
1) Provide releases or builds.  We’ve shut down our continuous integration server and do not plan on making any release after 0.9.4.
2) Offer any commercial services for LucidDB (consulting, services, sponsored development, etc).
3) Provide active development on the core project, or ancillary projects.

Once again, thank you for your support over the past few years and we encourage you to continue to look at LucidDB, even though we were unable to make it a commercial success.  It has some very unique features that are a perfect fit for some use cases (Big Data access via BI tools, etc) that make it a great open source project.

Kind Regards,
Nick
Former CEO of DynamoBI Corporation

LucidDB has left Eigenbase moved to Apache License

This has been a long time in the making, but the LucidDB project is leaving the Eigenbase foundation to continue our development outside that organizations IP sharing, framework, and governance.

Community members will notice (or have already):

  • We are no longer using Perforce (YAAY!) and are now doing our primary LucidDB, Farrago, Fennel, and relevant extensions/test/build development work at github: https://github.com/dynamobi/luciddb/
  • The Wiki is now hosted at http://www.luciddb.org/wiki.  We will, over time, remove references to Eigenbase in that project documentation/etc.
  • Issue tracking is now ALSO over at github, and we have migrated all issues (historical and outstanding) over to the github project.

Part of the impetus for leaving Eigenbase was our desire for a more inclusive license, to permit additional use/collaboration by other companies in the spirit of open source.  We initiated this process, in good company and like minded individuals early last year.  Long story short this plight and political battles cost Eigenbase the resignations of the two, most critical participants at Eigenbase: Julian Hyde and John V. Sichi.  I join them now, as I resigned from the Eigenbase Board March 26.

Today I’m announcing that DynamoBI has released the entirety of the codebase, under the Apache Software License 2.0.  We welcome our community members ongoing contributions, and hope that companies looking to leverage such a great framework and technology take a look.  We welcome, wholeheartedly, your participation in the project under it’s new permissive license.

We continue to serve our existing customers with annual subscriptions to DynamoDB, our QA’ed and prepackaged commercial version of LucidDB.

Happy LucidDB-ing!

NoSQL Now 2011: Review of AdHoc Analytic Architectures

For those that weren’t able to attend the fantastic NoSQL Now Conference in San Jose last week, but are still interested in the slides about how people are doing Ad Hoc analytics on top of NoSQL data systems, here’s my slides from my presentation:

No sql now2011_review_of_adhoc_architectures

View more presentations from ngoodman
We obviously continue to hear from our community that LucidDB is a great solution sitting in front of a Big Data/NoSQL system. Allowing easy SQL access (including super fast, analytic database cached views) is a big win for reducing load *AND* increasing usability of data in NoSQL systems.

Splunk is NoSQL-eee and queryable via SQL

Last week at the Splunk user conference in San Francisco, Paul Sanford from the Splunk SDK team demoed a solution we helped assemble showing easy SQL access to data in Splunk. It was very well received and we look forward to helping Splunk users access their data via SQL and commodity BI tools!

While you won’t find it in their marketing materials, Splunk is a battle hardened, production grade NoSQL system (at least for analytics). They have a massively scalable, distributed compute engine (ie, Map-Reduce-eee) along with free form schema-less type data processing. So, while they don’t necessarily throw their ring into new NoSQL type projects (and let’s be honest, if you’re not Open Source it’d be hard to get selected for a new project) their thousands of customers have been very happy with their ingestion, alerting, and reporting on IT data and is very NoSQL-eee.

The SDK team has been working on opening up the underlying engine, framework so that additional developers can innovate and do some cool stuff on top of Splunk. Splunk developers from California (some of whom worked at LucidEra prior) kick started a project that gives LucidDB the ability to talk to Splunk hereby enabling SQL access to commodity BI tools. We’ve adopted the project, and built out some examples using Pentaho to show the power of SQL access to Splunk.

201108231043.jpg

First off, our overall approach is similar to our existing ability to talk to CouchDB, Hive/Hadoop, etc remains the same.

  • We learn how to speak the remote language (in this case, Spunk search queries) generally. This means we can simply stream data across the wire in it’s entirety and then do all the SQL
  • We enable some rewrite rules so that if the remote system (Splunk) knows how to do things (such as simple filtering in the WHERE clause, or GROUP BY stats) we’ll rewrite our query and push more to the work down to the remote system.

Once we’ve done that, we can enable any BI tool (that can do things such as SQL, read catalogs/tables, enable metadata/etc) connect up and do cool things like drag and drop reports. Here’s some examples created using Pentaho’s Enterprise Edition 4.0 BI Suite (which looks great, btw!):

splunk_dash1_screen.png

splunk_dash2_screen.png

These dashboards were created, entirely within a Web Browser using Drag and Drop (via Pentaho’s modeling/report building capabilities). Total time to build these dashboards was less than 45 minutes including model definition and report building (caveat: I know Pentaho dashboards inside and out).

Splunk users can now access data in their Splunk system, including matching/mashing it with others simply and easily in everyday, inexpensive BI tools.

In fact, this project came about initially as a Splunk customer wanted to do some advanced visualization in Tableau. Using our experimental ODBC Connectivity the user was able to visualize their Splunk data in Tableau using some of their fantastic visualizations.

PDI Loading into LucidDB

By far, the most popular way for PDI users to load data into LucidDB is to use the PDI Streaming Loader. The streaming loader is a native PDI step that:

  • Enables high performance loading, directly over the network without the need for intermediate IO and shipping of data files.
  • Lets users choose more interesting (from a DW perspective) loading type into tables. In particular, in addition to simple INSERTs it allows for MERGE (aka UPSERT) and also UPDATE. All done, in the same, bulk loader.
  • Enables the metadata for the load to be managed, scheduled, and run in PDI.

201106292256.jpg

However, we’ve had some known issues. In fact, until PDI 4.2 GA and LucidDB 0.9.4 GA it’s pretty problematic unless you run through the process of patching LucidDB outlined on this page: Known Issues.

In some ways, we have to admit, that we released this piece of software too soon. Early and often comes with some risk, and many have felt the pain of some of the issues that have been discovered with the streaming loader.

In some ways, we’ve built an unnatural approach to loading for PDI: PDI wants to PUSH data into a database. LucidDB wants to PULL data from remote sources, with it’s integrated ELT and DML based approach (with connectors to databases, salesforce, etc).   Our streaming loader “fakes” a pull data source, and allows PDI to “push” into it.

There’s mutliple threads involved, when exceptions happen users have received cruddy error messages such as “Broken Pipe” that are unhelpful at best, frustrating at worse. Most all of these contortions will have sorted themselves out and by the time 4.2 GA PDI and 0.9.4 GA of LucidDB are released the streaming loader should be working A-OK. Some users would just assume avoid the patch instructions above and have posed the question: In a general sense, if not the streaming loader how would I load data into LucidDB?

Again, LucidDB likes to “pull” data from remote sources. One of those is CSV files. Here’s a nice, easy, quick (30k r/s on my MacBook) method to load a million rows using PDI and LucidDB:

201106292249.jpg

This transformation outputs to a Text File 1 million rows, waits for that to complete then proceeds to the load that data into a new table in LucidDB. Step by Step the LucidDB statements

— Points LucidDB to the directory with the just generated flat file
— LucidDB has some defaults, and we can “guess” the datatypes by scanning the file
CREATE or replace SERVER csv_file_server FOREIGN DATA WRAPPER SYS_FILE_WRAPPER OPTIONS ( DIRECTORY ‘?’ );
— Let’s create a foreign table for the data file (“DATA.txt”) that was output by PDI
>create foreign table applib.data server csv_file_server;
— Create a staging, and load the data from the flat file (select * from applib.data)
CALL APPLIB.CREATE_TABLE_AS (‘APPLIB’, ‘STAGING_TABLE’, ‘select * from applib.data’, true);

We hope to have the streaming loader ready to go in 0.9.4 (LucidDB) and 4.2 (PDI). Until then, consider this easy, straight forward method of loading data that’s high performance, proven, and stable for loading data from PDI into LucidDB.

Example file: csv_luciddb_load.ktr

Pushdown Query access to Hive/Hadoop data

Just in time for the Hadoop Summit, we’ve updated some pieces of LucidDB to be able to do “more” with the Hive driver. We’ve had the ability to connect and query data inside of Hive for about a year already. The initial implementation allowed people to:

  • Use Hive for it’s massive scalability, distributed data processing capabilities.
    Hive is great at processing huge amounts of data. It’s scales to hundreds of servers, and has a bunch of fantastic features for structured and semi structured data processing, fault tolerance, etc. Hive is a great way to do heavy lifting sorting through petabytes of data to arrive at some interesting, pre-aggregated datasets.
  • Cache the output of Hive views into LucidDB.
    Now when we’re talking about taking the output of Hive views into LucidDB, we’re not talking about SMALL datasets (10k rows) we’re talking about 50 or 100, or 500 million rows. Some might think that number is small (by Hive standards it often is) and others might think that’s big (our entire DW is only 200 million rows). However, LucidDB has provided the ability to draw in data from Hive via easy MERGE/INSERT statements.

You can see some background on this integration that has been functional since August 2010 on Eigenpedia: http://pub.eigenbase.org/wiki/LucidDbHive

Also, a video I recorded last year showing the basic integration: (THIS IS NOT NEW!!!!).

Why the blog now? We’ve done a couple of things over the past while.

  • We’ve done some work on LucidDB (yet to be committed and will be a POST 0.9.4 commit) that allows the use of Hives, well, somewhat unique driver. Hive’s driver has a bunch of quirks in terms of metadata, etc that we’re now recognizing and using properly over in LucidDB.
  • We’ve updated to the 0.7.0 release. We’re now ready to go with the latest and great Hive features.
  • We’ve enabled some pushdowns to work to allow for easier day to day loading of LucidDB tables from Hive, along with a limited workload of Ad Hoc SQL access.

Our vision for Big Data dictates the need for:

  • Live, real time, per query access to the Big Data system that is useful and practical (ie, filters, etc).
    This means that you need to be able to allow the user, via simple Parameter or simply by hitting a different schema or table access to the live data.
    201106242121.jpg  
  • Easy, full ANSI SQL access to high performance, low latency, aggregated data.
    Dashboards need results that come back in seconds, not minutes. LucidDB and the data cached there provide a great “front end” for easily doing real BI work on top of data that sits inside Hive.

We’ve updated our connectors to allow some filtering/projection pushdowns to work with Hive.

Here’s a simple example. We have a report or dashboard which is looking for only a subset of data in Hive. We want to allow the filtering of data to occur and for Hive to receive the filtering from our OLAP/Dashboard.

By default LucidDB will read the entire table and do all SQL functions over in our

201106241757.jpg

However, pulling over the entire table is really not going to work well for us. This would really be the worst of both worlds; you’d be better off just querying Hive directly. However, luckily we’ve enabled some pushdowns be pushed down to Hive.

201106241800.jpg

Notice that the condition IN( values ) is being pushed down to the remote Hive server.

Let’s try something a bit more complex!

201106241803.jpg

Currently, we’re able to push down most filters and projections.

Let’s take now take the use case where we’re trying to ONLY UPDATE records that have been updated since the last time we checked (ID > 97). More likely the key that we’d use to do this push down filter would be a date, but you can simply use your imagination.

Consider the following SQL:

merge into hive_test.”local_POKES” l
using (select * from hive_test.pokes where “foo” > 97)
ON l.”foo” = p.”foo”
when matched then update set “bar” = p.”bar”
when not matched then insert (“foo”, “bar”) values (p.”foo”, p.”bar”);

This SQL is a typical “incremental” load from a remote system. Syntactically a little dense, but it’s actually a VERY high performance method to load directly into LucidDB often eliminating the need entirely to draw the data through an intermediate server and process (ETL Tool).

201106241815.jpg

Our enhancements allow the Hive portion to be pushed down. Hive will ONLY return values greater than 97 and we’ll simply intelligently keep any changed records “up to date” in LucidDB for reporting.

Many of these changes will be in a patched version of LucidDB; we’ll make this patched release available to any customers who want these optimizations available, immediately for use with Hive. Let us know what you think by joining this conversation at the LucidDB forums: Hive Connector Pushdown

In a subsequent blog we’ll cover how to now match up data coming from Hive (or CouchDB) with data in other systems for reporting.

SQL access to CouchDB views : Easy Reporting

Following up on my previous blog about enabling SQL Access to CouchDB Views I thought I’d share what I think the single, biggest advantage is: The ability to connect, run of the mill, commodity BI tools to your big data system.

While the video below doesn’t show a PRPT it does show Pentaho doing Ad Hoc, drag and drop reporting on top of CouchDB with LucidDB in the middle, providing the connectivity and FULL SQL access to CouchDB. Once again, the overview:

HotorCold.png

BI Tools are commoditized; consider all the great alternatives available inexpensively (either in Open Source for free, Open Core, or even simply proprietary). Regardless of what solution you choose, these tools have fantastic, easy to use capabilities that are very easy for business users to build their own reports. After all, shouldn’t your developers be extending/creating new applications instead of fiddling with what filters your analysts/executives want to see on their dashboard?

Driving the developer out (as much as possible) is one of the best reasons to try and enable your cool, CouchDB views via SQL.

Here I’ll demonstrate, once we’ve connected LucidDB to our CouchDB view, how a BI Tool can:

  • Easily see the data, and understand it’s datatypes. Metadata is well understood between SQL databases and BI tools.
  • We can easily use a familiar query language, SQL, that allows for aggregation, filtering, and limiting. This gives a huge swath of BI tools the ability to talk to CouchDB.
  • We translate the SQL we receive into optimized* RESTful HTTP view requests.

Per a reader suggestion here’s a video showing the solution, as opposed to the screenshots (PS – Let us know what you  think about the CouchDB SQL access, or also the Video vs Screenshot approach).

It picks up right after the previous section. Once we have that CouchDB view in LucidDB then Pentaho (or other BI tools) can connect and access, do ad hod reporting like they always have). As a certified database for Pentaho, you can be quite comfortable that Pentaho will work very very well with LucidDB.

PENTAHO does not even KNOW it’s talking to CouchDB -> It has NO idea; Pentaho thinks it’s a database just like any other

Without further delay:

*optimized = we have a few optimizations available to us, that we’ve not yet put into the connector. For instance, the ability to filter to a particular key (where key = XYZ) pushed down, or group_level=*. This will come over time as we enhance the connector. For now, we’re doing very little in terms of pushing down SQL filters/aggregations into the HTTP view. However, your view itself is almost CERTAINLY aggregated and doing this anyhow.

We’re keen on discussing with CouchDB/Hive/other Big Data users about their Ad Hoc and BI needs; please visit the forum thread about the connector.