OWB10g Paris : Data Profiling

The new Data Profiling features of the upcoming OWB release will knock your socks off. Other new features are applicable, useful, and successfully augment the product and make it better. Data Profiling opens up a whole new world to OWB with it’s wide applicability outside of Oracle Business Intelligence. In other words, readers should note that this feature is not just for OWB Developers but rather could merit firing up OWB purely to use this feature. OWB developers might be a bit disappointed with how OWB works to manage corrections, which are based on the output from Data Profiling but that shouldn’t dissuade anyone from using this feature.

Most BI and Data Warehousing professionals fancy themselves experts of their organizations’ data; they think they understand their structures, volume, and understand how they live and breathe. In other words, if you are curious which data in your enterprise is least clean consult your DW operations staff and they’ll provide anecdotal tales of why the POS system is a POS (different acronym) and that they “often get erroneous transactions with special adjustment code 99.” What is typical, and ironic given the DW mandate of providing insight and information, is that the DW team can not quantify and measure the quality of their data. What kind of questions could you use to drive data quality initiatives in your enterprise? Knowing the cost of scrubbing and manually restating (Sarbanes-Oxley consultants cringe here) data and the overall cost of making decisions by delayed or improper data wouldn’t you want to know where to start working?

Data Profiling addresses many needs by providing a set of functionality to examine, profile, and measure the quality of data. It provides a set of screens that allow you to point the application at a Data Source and immediately get some valuable information about your data. The Data Profiling features of OWB fall into three broad categories:

  • Attribute Analysis : This type of profiling revolves around the content and structure of particular attributes in your data sets. This involves numerical precisions, distinct values, outlier identification, min/max/avg, declared types versus actual types, and pattern typing.
  • Referential Analysis : Understanding how data relates to other data is a very important component of data profiling. They wouldn’t call it a relational database if the relationships amoungst data weren’t one of the fundamental and most important concepts of data organization. Identifying orphans, joins, childless entries, and redundancy identification all fall under the umbrella of Referential Analysis.
  • Functional Analysis : Beyond the pure join relationships, there are also ways of surmising how the data functionally relates to each other. One can surmise the relationship between attributes and relationships for determination and dependencies. Normalization candidate identification can be achieved by leveraging this analysis (identify which attributes might be prime candidates for factoring out and normalizing into an exterior table).

While it might not be as small as a Fax Wizard the Data Profiling features of OWB require a sort of linear progression, especially early on in the process. This makes sense; you have to connect to your data, identify which data is of interest, retrieve and analyze your data, explore and understand your data, and optionally deploy code to monitor and make corrections to your data.

I’ve included a few screenshots, which speak quite clearly to the types of analysis that OWB can perform on your data. It should be noted, that since Oracle retrieves and analyzes the data in a temporary workspace that one can perform this analysis on data contained in nearly any heterogenous supported connection, including plain old flat files.


Figure 1. Screenshot of some of the aggregation statistics available. Note the ability to actually see the detail rows here


Figure 2. If you don’t feel like consuming the analysis in numerical form OWB provides a way to visualize characteristics of your data


Figure 3. Domain stats are very useful for determining status codes, valid product categories, etc. It also shows details and percentages

Probably the feature Oracle is hoping you’ll use most, and why Data Profiling is included in OWB instead of an alternate tool or standalone product is OWB’s ability to monitor and make corrections to data based on the work one does in the Data Profiling section of the product. OWB endeavors to allow you to quickly, and programatically leverage the results of an analysis into working monitors and data cleansing routines. It’s a noble start at implementing easy rule based automated ETL generation and cleansing of data but has some difficulties in ongoing management. See Note Number 1 Below


Figure 4. Based on a rule you’ve defined as part of your profiling, launch the Create Correction wizard


Figure 5. One can define several actions that are implemented to “correct” the data. There are some excellent capabilities for the correction action (Match Merge, Remove, etc)

One can create rules and corrections from inside the Data Profiler that can be deployed to correct and audit your data. Consider a real life example of a data source that contains order information, including order status history records. An order will contain information about the Customer, order amount, etc. Order History records contain information about what occurred to the order and when (received by warehouse, shipped, etc). You’ve analyzed your current data set using the data profiler and you’ve found that Order History records arrive with status codes that are two letter codes, but occasionally) when orders are forced to a different status an operator uses mixed case (“Sh” instead of “SH”) for the status codes. You’ve got 100k order statuses conforming to your dictionary of status codes, and a couple hundred that are off by one lower case character. Your company is entitled to a 10k discount from your fulfillment partner company every month the error rate exceeds .5% off the agreed upon data formats.

How can you leverage the new OWB feature to: audit the error rates of the received data; and correct the simple case of mixed case status codes?

  1. Connect to and identify the data of interest. Depending on the arriving format for the data, you’ll need to decide how to get it imported into the land of OWB and the Data Profiler. If it’s in a SQL Server or Oracle database you can enter in some connection information and connect directly and browse the data dictionary. If it’s in a flat file, you’ll need to sample and describe the data file.
  2. Run the profiler. Review the information that you’re seeing about the particular attribute(s) in question. In our example we’re interested in the STATUS CODE column of the order history records. The profiler will indicate a suggested domain (valid uppercase status codes) and then indicate how many (and %) records do not conform to those specifications. You can drill into these (VERY USEFUL feature) records to see that nearly all are still valid two letter codes, but have mixed case.
  3. Derive a Rule based on the domain of the records. You select the attribute, and use the Derive Rule wizard to create a rule that says the STATUS CODE must be one of the valid upper case codes.
  4. Create a correction based on the rule. You’ve now created a rule that specifies valid values for that column. How does one make corrections to these records so that they can be processed along with their compliant peers? Using the create correction wizard you can specify how to react to the rule violation. In this case one would use some of the provided logic (match merge capabilities of OWB) and define the logic for identifying the mixed case codes, and correct them by applying some TO UPPER CASE logic. This will create a mapping and other requisite structures for executing the logic. Note: The fact that the corrections are themselves OWB objects is a blessing and a curse. More on that toward the end of this article
  5. Add the mapping to your process flows. This will execute, in the place you deem it necessary where the data will be evaluated and possibly corrected.
  6. Schedule an auditing process. Outside of the correction execution you might just want to evaluate the state of the data so that you have a record of data received, it’s error rates, why it failed it, etc. You can do an audit ad-hoc, scheduled, or within a process flow. Naturally many locations will like to retrieve data from their source systems, perform the audit as part of the process flow, and then proceed with any correction and further warehouse processing.

Data Profiling is a very worthwhile feature in the upcoming release of OWB. It is certainly useful to a BI consultant who needs to quickly learn and understand a customers data. I hope readers find it as useful as I have.

This blog is part of the OWB Paris Early Review series which reviews and comments on several new Paris features.

NOTE 1: Corrections are generated in a one time process and it generates all sorts of OWB objects (mappings, tables, etc) to accomplish a correction. While this is handy, I’m guessing the ongoing management will be difficult. OWB creates one of their new “nested” mappings for implementing the cleansing rule and then the OWB developer includes this pluggable mapping in your ETL(you can run it standalone if you so choose). It’s a one time generation and you can’t manage the rule. If you change your correction (include this NEW particular code or type) you have to regenerate the correction and possibly re-add it to your ETL.
NOTE 2: Thanks to Paul and Nikolai for their contribution to this blog.

OWB performance, start here

There are literally millions of bits of data and pages in books on how to tune Oracle and PL/SQL. Most important to BI professionals is the need to beat the clock, and have their data ready by start of business(8am) at a minimum. Of course if you’re a right time environment you have even more stringent requirements. It’s likely your data processing takes a while and OWB can give you some great information on where it’s spending time. This is useful if you’re trying to pick a few mappings to “tune” and get the most bang for the buck.

Connect to the OWB_RUNTIME schema, or use select access from another use such as OWB_USER.

select
OBJECT_NAME name,
trunc(min(elapse_time)) min,
trunc(max(elapse_time)) max,
trunc(sum(elapse_time)) sum,
trunc(avg(elapse_time)) avg,
count(elapse_time) count
from all_rt_audit_executions
where 1=1
AND task_type = ‘PLSQL’
AND created_on >= to_date(’12/10/2004′, ‘MM/DD/YYYY’)
group by OBJECT_NAME
order by avg desc;

Expressed in Seconds… Yields the following results (maps changed to protect the innocent)

NAME	MIN	MAX	SUM	AVG	COUNT
map1	0	31988	165372	20671	8
map2	5494	68905	135111	19301	7
map3	1672	3509	20542	2567	8
map4	316	3511	14502	1812	8
map5	1018	2170	13089	1636	8
map6	436	1353	6784	848	8
map7	478	1272	6476	809	8
map8	309	2243	5351	668	8
 ..................

You’ll see this identifies several candidates for tuning. You also have an idea of what kind of performance gain you can hope to expect. If you improve the avg time of map1 by 10% you’ll save yourself approximatley 20671 * .9 / 60 = approx 35min per execution. This is not hard and fast, seeing as though some tasks run in parallel (if map 4 is always waiting for map3 in a process flow, you won’t cut down your time).

50K SQL Statements for 100 rows of data

Unlike a great deal of Oracle professionals I started my career building Software using OO methodologies and Java. I approach the Oracle world with an application developer paradigm, and have had to unlearn certain OO precepts. Mostly I’ve had to concede that in practical development of technology solutions (greatest benefit, least cost, highest ROI) that the persistence layer can not be a big dumb chalkboard exclusively. Perhaps some of my Java colleagues may disagree but there are significant cost, implementation, and maintenance advantages to leveraging the power of a full featured RDBMS.

As I was reading Mark Rittman’s article on his Performance Tuning Excursion I was reminded of a SWAT task I was given working with the DBA group at an online retailer. The data access patterns of OO, their OO-Relational mapping components, etc can sometimes be downright counterproductive towards building a sound technology solution. I was working with an original implementation of this and witnessed a seldom used feature end up nearly bringing the entire site to it’s knees. The page received a brand name, and was meant to display the products and their respective primary categories. The result sets would range from 1 to about 100 products. I witnessed a similar behavior to what Mark describes in that the persistence layer was querying thousands of times for smaller atomic units to build the composite objects on the application server. I’ll leave it to the reader to delve into Mark’s thorough explanation…

Consider using the extended SQL trace instructions from Mark or further consider p6spy if you don’t have access to Oracle (and it’s a Java application of course).

OWB10g Paris : Experts

In my opinion the name does not adequately capture the functionality but I’m not sure I can suggest anything better. The Expert feature is a combination macro language, best practices scripting, OWB extension SDK, and labor saver. While it might be tough to name the new feature which does a great deal, it is not tough to see the direct applicability of Experts for OWB customers.

OWB Experts is a new OWB feature that allows for building guided experiences within OWB. Experts allow one to expose limited pieces of OWB, and OWB functionality a guided manner. Many programs have this type of functionality, with Microsoft Office being famous for it’s effective use of Fax Wizards, Mail Merge Wizards, etc. NOTE: The OWB product team assures me they have no plans to consider, ever, never in a millions years add a dancing paper clip to OWB.

Why would Oracle provide Experts functionality, to throttle their product back and provide views of it that make it appear to do less? While OWB is an immensely powerful tool it’s learning curve is not insignificant, and I’ve witnessed many competent Engineers/DBAs users overwhelmed by the complexity of the product. Beyond that, as part of their modest Developer Suite seat it’s an inexpensive product to involve many hands throughout an enterprise to make light work.

There are two things you can do with experts, and there are a variety of options on how to do both. You can design an expert, and you can run experts. Designing an expert is typically done by power OWB users. These are your Data Warehouse Architects, ETL Engineers, and DBAs. Experts can be run by anyone, from domain experts who happen to be novices in OWB to ETL engineers adhering to certain best practices. Obviously experts have to be designed before they can be run but it’s helpful to understand where they’ll be run and used before discussing how they’re built.

RUNNING EXPERTS:
Experts can be run by either executing them inside of OWB, or embedding them in third party applications. Embedding them in third party applications opens up a whole slew of possibilities for advanced customer implementations and Oracle partners. While you can embed them in your applications, most OWB experts will be run inside of OWB. There is a nice feature of the Expert implementation that allows execution of a specific expert on OWB client startup. This would be useful if you wanted to provide a limited view so that users would only see the expert, and not really see OWB at all. Of course, you can run an expert the good old fashioned way by executing it from within the OWB tree. There are plenty of options to suit, and suffice to say that you’ll have the capability to provide your experts to users in a method that suits your users.

DESIGNING EXPERTS:
Experts are designed inside of OWB using a graphical interface that is similar to the editors for other OWB objects. An expert is composed of Tasks (things to do) and transitions (process and flow control). In my opinion, it most closely resembles the Process Flow editor in that you are defining a sequence, combination, and processing logic. There is a palette for adding tasks to the expert, a method for creating transitions between Experts, a properties box for setting specific values for task. There are forks and dispatching transitions, which allow the wizard to take a different path based on user input. This feature allows for screens 2 and 3 to request different information, based on the input the user entered on the first screen. The transitions are instrumental in defining order and flow control, however the real usefulness comes from the palette of possible Tasks.

There are numerous tasks that are on the Beta version palette, and I’m not sure the list has been entirely finalized or publicized(there were approximately 20 in the build I used). These are the actual “steps” of the Expert and are usually observed by a particular screen in the application. A few to help you mull over possibilities of Experts in your projects include:

  • Custom Dialog : This is immensely useful, as it allows for a quick way to collect responses from the user. It basically pops up a screen that is configured by the Expert designer and receives information from the user (perhaps a table name, or some Oracle connection information).
  • Data Viewer : The OWB product has exposed their “data viewer” component on the palette. In some use cases, one needs to display the data from a data table to help the user either confirm they’ve got the correct table, or understand a bit more about the contents of the table.
  • Impact Analysis : This helps the user understand what the impact is of modifying an object. This would be useful to guide a user through the modification of a dimension, to ensure they are thinking and noting any impacts before actually making the change.
  • Java : This is really a “blank check” in the Expert functionality. This task allows you to embed Java application code directly into this point within the Expert. The possibilities really become quite endless when you consider the full power of Java available in your Expert creation. There will likely be some limitations of how you interact with OWB (or more accurately, how easily one interacts with OWB) but this feature demonstrates the OWB understanding that they are never going to be your only tool/system.

Perhaps it’s appropriate to wrap up with an example use of an expert:

Let us consider a situation where you are using a third party ROLAP tool that has a metadata repository. This is where the OLAP tool goes to understand the structure, hierarchy, cubes and dimensions that are stored as tables inside Oracle. It has it’s own specific metadata format/API and one can make Java calls to a web service to define and describe the metadata of your marts. When your BI team adds a dimension in OWB they must also run through some series of steps to add it to ROLAP tool metadata so that end users can see it in their reporting tool. Every time they follow the same process, add dimension, go to tool and add it.

Consider creating an “Add Dimension” expert in the upcoming release of OWB. The expert would simply:

  1. Launch the Add Dimension Wizard
  2. Allow the user to fill in the OWB Dimension, hierarchies, levels, attributes, etc
  3. Use Java to call some web service and describe the new Dimension to the ROLAP tool

Steps 1 and 2 are easy to just drop into the Expert and configure. Step 3 would require the authoring of custom Java code that would be executed. The key here though is that it is done so in the context of OWB and so has immediate knowledge of what dimension was just added, at what location, with what underlying table name, etc. The benefits of automating these types of tasks has always been clear; minimize actual time spent to accomplish repetitive tasks and minimize errors by programatically enforcing standards and best practices.

Experts hold great potential for power customers, consulting firms, and Oracle Partners to add best practices and efficiencies to an already efficient and powerful tool.

This blog is part of the OWB Paris Early Review series which reviews and comments on several new Paris features.

NEED A LOT OF ROWS?

Ok, this is not earth shattering, but perhaps google might find this blog for someone in need of a quick and dirty method for generating a bunch of “dummy” rows in oracle without heading into the land of PL/SQL.

Use the table ALL_SOURCE, which always has several hundred thousand rows (just checked one DB and it has 435608).

insert into dummy_table (key, value)
select rownum, type from all_source where rownum

BAD DEVELOPER, SIT IN THE DARK

I ran across a post by Andrej Koelewijnvia on orablogs.com that made reference to an open source project named CruiseControl.

CruiseControl is a framework for a continuous build process. It includes, but is not limited to, plugins for email notification, Ant, and various source control tools. A web interface is provided to view the details of the current and previous builds.

I’ve worked in environments with automated build processes and think they are absolutely wonderful. There are some significant advantages to an automated build process:

  • Less time spent tagging and building code to servers.
  • Predictable process for build and deploy (you do it with technology, rather than admins typing commands) so that your deploys are also “managed” across environments.
  • Mitigates big unknowns during integration. I’m not saying this will decrease the integration time spent on a project, but rather it will increase the likelihood of finding a show-stopping issue early.

    Imagine a progressive work environment (a la Dot Com) where engineers are kindred spirits. They are working late hours, ordering pizza, playing with Nerf guns. This is the type of environment where the following extension would be useful.

    If you break the build, you have to sit in the dark all day.

  • Oracle on VMWare Linux, Part 2

    I’ve written about Oracle on VMWare before, but thought I would share an alternative perspective with readers. Howard Rogers has published a nice article on how to install VMWare, install White Box Linux, and suggests it works perfectly well. While I wouldn’t now suggest that production environments consider it suitable, I’m glad that someone has had a better experience with this product than I’ve had.
    Note: There’s also an interesting discussion about his favorite font, for those looking for a digression.

    What's that Java Application doing?

    An engineer at one of my customer projects was having some difficulties with a third party program and Oracle. The application was referencing objects that didn’t exist, and throwing ORA- errors. There were difficulties in getting the attention of the Oracle DBA group to help troubleshoot the issue, so he found a clever way to find his information. He plopped in p6spy.

    P6Spy is an open source framework to support applications that intercept and optionally modify database statements. The P6Spy distribution includes the following modules:

  • P6Log. P6Log intercepts and logs the database statements of any application that uses JDBC. This application is particularly useful for developers to monitor the SQL statements produced by EJB servers, enabling the developer to write code that achieves maximum efficiency on the server. P6Spy is designed to be installed in minutes and requires no code changes.
  • P6Outage. P6Outage detects long-running statements that may be indicative of a database outage proble and will log any statement that surpasses the configurable time boundary during its execution. P6Outage was designed to minimize any logging performance penalty by logging only long running statements.

  • He indicated it took only a few minutes to plop it in the application, and troubleshoot which object was missing. Helpful little utility if you’re in a bind. Thanks for the tip Russ!