The data warehouse market breaks into a trot

The latest figures from IDC (who, by the way, are by far the most reliable of the analyst forms when it comes to quantitative estimates) is that the data warehouse market will grow at a 9% compound rate from now through to 2009, reaching USD 13.5 billion in size (up from USD 10 billion today), as reported in an article on the 17th of March. Gartner also reckon that this market is growing at twice the pace of the overall IT market (their estimates are slightly lower, but would trust IDC’ more when it comes to figures). It would be interesting to see the proportion of this that is packaged data warehouse software (see the recent report by Bloor) but unfortunately they don’t split out the data in this way. This figures does not include services, but based on other analyst estimates this market is at least three times this size; there never seems to be any shortage of need for systems integrators.

Given all the billions spent on ERP systems in the last ten years or so, it is about time that more attention was paid to actually trying to make sense of the data captured in these and other transaction processing systems, which for a long time have consumed the lion’s share of IT development budgets. After all, there is likely to be more value in spotting trends and anomalies in the business than in merely automating processes that were previously manual, or in just shifting from one transaction processing system to another.

Should ETL really be ELT?

Traditionally ETL (extract/transform/load) products such as Informatica, Ascential and others have fulfilled the role of getting data out of source systems, dealing with inconsistencies between these source systems (transform) and then loading the resultant transformed data into a set of database tables (perhaps an operational data store, data marts or directly to a data warehouse).
However in the process of doing the “transform” a number of issues crop up. Firstly, you are embedding essentially what is a set of business rules (how different business hierarchies like product classifications actually relate) directly into the transformation rules. This is a dark place should you want to make sense of them in other contexts. If the rules are complex, which they may well be, then you can create a Frankenstein’s monster of transform rules that become difficult to maintain, in a set of metadata that may be hard to share with other applications.

Moreover this is a one-way process. Once you have taken your various product hierarchies (say) and reduced them to a lowest common denominator form, then you can certainly start to analyze the data in this new form, but you have lost the component elements to all intents and purposes. These different product hierarchies did not end up different without some reason; they may reflect genuine market differences in different countries, for example. Moreover they may contain a level of richness that is lost when you strip everything down to a simpler form.

Ideally in a data warehouse you would like to be able to take an enterprise view, but also retain the individual perspectives of different business units or countries. For example it may be interesting to see the overall figures in the format of a particular business line or country. Now of course there are limitations here, since data from other businesses have not have sufficient granularity to support the views required, but in some cases this can be fixed (for example by providing additional allocation rules) and at least you have a sporting chance of doing something useful with the data if you have retained its original richness. You have no chance if it is gone.

Hence there is a strong argument to be made for an “ELT” approach, whereby data is copied from source systems pretty much untouched into a staging area, and then only from there is transformation work done on it to produce cross-enterprise views. If this staging area is controlled by the data warehouse then it is possible to provide other, alternate views and perspectives, possibly involving additional business metadata at this stage. The only real cost in this approach is some extra storage, which is hardly a major issue these days. Crucially, the transformation logic is held within the data warehouse, which is open to interrogation by other applications, and not buried away in the depths of a proprietary ETL format. Moreover, the DBMS vendors themselves have added more capability over the last few years to deal with certain transformations; let’s face it, a SQL SELECT statement can do a lot of things. Since the DBMS processing is likely to be pretty efficient compared to a transformation engine, there may be performance benefits also.

This approach has been taken by more modern ETL tools like Sunopsis, which is explicitly ELT in nature. Intriguingly, Informatica added an ELT option in Powercenter 8 (called the “PowerCenter 8 pushdown optimization”), which suggests that this approach indeed is gaining traction. So far, good on Sunopsis for taking the ELT approach, which I believe is inherently superior to ETL in most cases. It will be interesting to see whether Ascential also respond in a future release.

Broaden your horizons

At a talk at the recent TDWI show, consultant Joshua Greenbaum, an analyst with Enterprise Applications Consulting (who?) managed to bemoan the cost of data warehouses, but then demonstrates a seeming lack of knowing exactly what one is by claiming that the alternative is to do “simple analyses of transactional data”. Well Joshua, that is called an operational data store, and indeed it has a perfectly respectable role if all you want to do is to look at a single operational system for operational purposes. However a data warehouse fulfils quite a different role: it takes data from many different sources, allows analysis across these inconsistent sources and also should provide historical context e.g. allowing comparisons of trends over time. You can’t do these things with an operational data store.

Hence it is not a case of “ODS good, data warehouse bad” – instead both structures have their uses. Of course Joshua is right in saying that data warehouse success rate is not great, but as I have written elsewhere, it is not clear whether data warehouse projects are really any worse than IT projects in general (admittedly, that is not setting the bar real high). Perhaps Joshua was misquoted, but I would have expected something more thoughtful from someone who was an analyst at Hurwitz. Admittedly he was an ERP (specifically SAP) analyst, so perhaps has a tendency to think of operational things rather than things wider than ERP. Perhaps he is suffering from the same disease that seems to affect people who spend too much time on SAP.

Interest in MDM grows

Last week I was a speaker at the first CDI (customer data integration) conference, held in San Francisco. Although the CDI institute (set up by Aaron Zornes, ex META group) started off with customer data integration, looking at products like Siperian and DWL, the general movement towards MDM as a more generic subject has overtaken it, and indeed Aaron mused in his introductory speech whether they may change the title to the MDM Institute. For a first conference it was well attended, with 400 people there and supposedly 80 being turned away due to unexpectedly high demand. There were the usual crowd of consultants happy to advise expertly on a topic they had never heard of a year ago. Most of the main MDM vendors put in an appearance e.g. IBM/Oracle/I2 (but no SAP) as well as specialists like Siperian and Purisma, plus those like HP who just have too big a marketing budget and so have a booth everywhere, whether or not they have a product (those printer cartridges generate an awful lot of profit).

The conference had a rather coin-operated feel, as sponsoring vendors duly got speaker slots in proportion to the money they put in – with IBM getting two plenary slots, but there were at least a few customer case studies tucked away amongst the six concurrent conference tracks. My overall impression was that MDM is a bit like teenage sex: everyone is talking about it, people are eager to know all about it but not that many are actually doing it. As time passes and MDM moves into adolescence there will presumably less foreplay and more consummation.
Further conferences are planned in London, Sydney and Amsterdam, demonstrating if nothing else that plenty of vendors are willing to pay Aaron to speak at the shows.

It’s all in the timing

Database guru Colin White raises a point in a BI network article that is often overlooked by commentators. To quote:

“…another difficulty is that customer reference data and relationships vary over time. This issue has important implications for business intelligence applications that may analyze customer data across various time periods, comparing revenue this month to this time last year, for example. If, during the last 12 months, the customer hierarchies have changed or the sales organization has been restructured, then this will affect the validity of the comparison. This means that metadata and metamodel changes may have to be tracked and recorded in MDM applications.”

Spot on Colin! Except that “may have to be tracked” should be “must”. Organizations do not make significant changes to their business stcuctures every day, but these changes do happen every few weeks or months e.g. reorganizations occur, marketing reclassify their product hierarchy or customer segmentations, finance changes allocation rules. Yet most MDM products concentrate on point-in-time synchronisation e.g. of customer data, and frequently retain no history whatsoever. Hence when you want to make comparisons over time, or go back and reconstruct something in the pass to deal with a compliance request, the task is difficult to impossible since the old transactions may be archived, but the master data associated with them is typically not.

A well-designed system to handle master data should be able to reconstruct past hierarchies for comparison. Just as within a data warehouse, where you often want to go back and look at data “as is” and “as was” and even “as it would have been”, the need to go back in time and understand the changes in master data is very important. However the big vendors don’t understadn this issue properly, and most customers have just started dabbling in MDM so haven’t thought yet about this thorny issue, which will come back to bite them in due course.

Packaged Data Warehouse Market

I was impressed with the depth of a recent Bloor report “Packaged Data Warehouses”, which examines in depth Decision Point, Kalido, IBM DM”, SAS, SAP BI, Showcase and Teradata. In these days when some analyst firms’ presentations barely mention vendors at all it is great to see an old-fashioned, detailed set of product evaluations. Also, unlike many, this report was not paid for by a particular vendor, so the companies covered within it have had no opportunity to influence the results.

The report evaluates and scores the products by:

  • stability and risk
  • support
  • performance
  • ease of use
  • fit for purpose i.e. product quality
  • architecture
  • value for money

Having looked carefully through the Kalido evaluations, it seems fair and accurate, and clearly is the product of a great deal of work. I commend Philip Howard, the analyst who led this, for producing such a comprehensive piece of analysis in an age of lightweight analyst sound-bites.

The report is not cheap, but it does have genuinely good in-depth analysis, and anyone considering a data warehouse project should buy this before they contemplate building one themselves.

One more time

An article from Wayne Eckerson, research director at The Data Warehouse institute (TDWI) has some sound advice about how to revitalize a data warehouse, based on a case study by Greg Jones of Sprint Nextel Corp. As the article says:

“Many data warehouses are launched with much fanfare and promise but quickly fail to live up to expectations”

and indeed multiple studies have shown high failure rates for data warehouse projects. I was at a Gartner conference earlier this week when an analyst stated that “the vast majority” of business intelligence initiatives fail to deliver tangible value. Yet, as a wise colleague of my often says:

“There is never time to do it right, but always time to do it again”

By this he means that data warehouse projects cut corners and make simplifying assumptions in their design about how the business works. It is much harder to make the design truly robust to business change, and yet this inability to deal properly with major business change is what eventually leads to problems for most data warehouses. A reorganization occurs, and it takes three months to redesign the star schema, fix up the load routines, modify the data mart production process, test all this etc. In the mean time the business is getting no up-to-date information. What do they do? They knock up a few spreadsheets or perhaps something quick in MS Access, “just for now”. Then another change happens two months later: the company buys another company, which of course has different product codes, customer segmentation, cost allocation rules etc to the parent. Putting this new data into the warehouse is added to the task list of the data warehouse team, who have yet to finish adapting to the earlier reorganisation. The business users need to see the whole business picture right now, so extend their “temporary” spreadsheet or MS Access systems a little bit more. Since they have control of these, they start to do more using this, and after a time it hardly seems like the data warehouse is really that necessary any more. Of course they let the IT people get on with it (it is not their budget after all) but usage declines, and they give up telling the data warehouse team about the next major new requirement, as they never seem to see results in time anyway. Eventually the data warehouse falls into disuse. Eventually a new manager comes in and finds the spreadsheet and MS Access mess to be unmanageable, and a new budget is found to have another go, either from scratch or by rewriting the old warehouse. And so the cycle begins again.

Sound familiar? The overriding issue is the need to reflect business change in the warehouse quickly, in time for the business customers to make use of it, and before they start reverting to skunkworks spreadsheets and side solutions that they can get a contractor to knock up qucikly.
Until the industry starts adopting more robust, high quality modeling and design approaches, such as that based on generic modeling, this tale will repeat itself time and time again. The average data warehouse has 72% maintenance costs i.e. if it costs USD 3M to build, it will cost over USD 2M to maintain, every year. This is an unsustainable figure. Still, there will always be a new financial year, and new project budgets to start again from scratch…..

A data warehouse is not just for Christmas

A brief article by Bill Inmon addresses a key point often overlooked – when is a data warehouse finished? The answer is never, since the warehouse must be constantly updated to reflect changes in the business e.g. reorganizations, new product lines, acquisitions etc.

Yet this is a problem because today’s main data warehouse design approaches result in extremely high maintenance costs – 72% of build costs according to TDWI. If a data warehouse costs USD 3M to build and USD 2.1M to maintain annually then over five years you are looking at costs well over USD 11m (let’s generous allow a year to build plus four years of maintenance) i.e. many times the original project cost. These levels of cost are what the industry has got used
to, but these are very high compared to maintenance costs for OLTP systems, which ttypically run at 15% of build costs annually. This high cost level, and the delays in responding to business change when the warehouse schema needs to be updated, contribute to poor perception of data warehouses in the business community, and high perceived failure rates. As noted elsewhere, data warehouses built on generic design principles are far more robust to business change, and have levels of maintenance around 15%.

If the data warehouse industry (and the business intelligence industry which feeds on it) is to continue to grow then it needs to grow up also, and address the issue of better data warehouse design paradigms. 72% annual maintenance costs are not acceptable.

Desperate Data Warehouses

A Gartner Group report mentions that at least 50% of data warehouse projects fail. Of course on its own this sounds bad, but just how bad is it, and what is meant by failure e.g. is being one month late failure, or does it mean complete failure to deliver? How do IT projects in general do? Standish Group run a fairly exacting survey which in 2003 covered 13,522 IT projects, a very large sample indeed. Of these just 34% were an “unqualified success”. Complete failure to deliver were just 15%. The rest are in the middle i.e. they delivered but were not perceived to be complete successes in some way. To be precise: 51% had “cost overruns, time overruns, and projects not delivered with the right functionality to support the business”. Unfortunately the Gartner note does not define “failure” as precisely as Standish; they define the “over 50% as being “limited acceptance or be outright failures”. It is also unclear whether the Gartner figure was a prediction based on hard data, or the opinion of one or more of their analysts.

The Standish study usefully splits the success rate by project size, with a miserable 2% of projects larger than USD 10M in size being complete successes, with 46% of projects below USD 750k being complete successes, 32% up to USD 3M and, 23% at USD 3-6M and 11% at USD 6-10M. The average data warehouse project is somewhere around the USD 2-5M range, with USD 3M often quoted, so indeed on this basis it would seem we should only expect around 25% or so to be “unqualified successes”. Unfortunately I don’t have data available for the failure rate split by size, which presumably may follow a similar pattern, and the rather loose definition that Gartner use makes it hard to compare like with like.

Even if turns out that data warehouse projects aren’t any (or at least much) worse than other IT projects, this is not a great advert for the IT industry. The Standish data most certainly gives a clear message that if you can possible reduce the scope of a project to smaller, bite-sized projects, then you greatly enhance your chance of success. It has long been known that IT productivity drops as projects get larger. This is due to human nature – the more people you have to work with, the more communication is needed, the more complex things become, and the more chance of things being misunderstood or overlooked.

It is interesting that even very large data warehouse projects can be effectively managed in bite-sized chunks, at least if you use a federated approach rather than trying to stuff the entire enterprise’s data into a single warehouse. Projects at BP, Unilever, Philips, Shell and others have taken a country by country approach, or a business line by business line approach, with individual warehouses feeding up to either regional ones or a global one, or indeed both. In this case each project becomes a fairly modest implementation, but there my be many of them. The Shell OP MIS project involved 66 separate country implementations, three regional and one global. Overall a USD 50M project, but broken down into lots of manageable, repeatable pieces.

So, if you data warehouse project is not to become desperate, think carefully about a federated architecture rather than big bang. This may not always be possible, but you will have a greater chance of success.

The next generation data warehouse has a name

Bill Inmon, the “father of the data warehouse” has come up with a new definition of what he believes a next generation data warehouse architecture should follow. Labeled “DW 2.0” (and trademarked by Bill) the salient points, as noted in an article in DM Review are:

– the lifecycle of data
– unstructured data as well as structured data
– local and global metadata i.e. master data management
– integrity of integrated data.

These seem eminently sensible points to me, and ones that indeed are often overlooked in first generation custom-build warehouses. Too often these projects concentrated on the initial implementation at the expense of considering the impact of business change, with the consequence that the average data warehouse costs 72% of implementation costs to support every year e.g. a USD 3M warehouse would cost over USD 2M to support; not a pretty figure. This is a critical point that seems remarkably rarely discussed. A data warehouse that is designed on generic principles will reduce this figure to around 15%.

The very real issue of having to deal with local and global metadata including master data management is another critical aspect that has only recently come to the attention of most analysts and media. Managing this i.e. the process of dealing with master data, is a primary feature of large-scale data warehouse implementations yet the industry has barely woken up to this fact. Perhaps the only thing I would differ with Bill on here is his rather narrow definition of master data. He classifies it as a subset of business metadata, which is fair enough, but I would argue that it is actually the “business vocabulary” or context of business transactions, whereas he has a separate “context” category. Anyway, this is perhaps splitting hairs. At least it gets attention in DW 2.0, and hopefully he will expand further on it as DW 2.0 gets more attention.

The integrity of “integrated” data addresses the difference between truly integrated data that can be accessed in a repeatable way, and the “interactive” data that needs to be accessed in real-time e.g. “What is the credit rating of customer x” that will not be the same from one minute to the next. Making this distinction is a useful one, as it has caused much confusion whereby EII vendors have claimed that their way is the true path, when it patently cannot be in isolation.

I am pleased that DW 2.0 also points out the importance of time-variance. This is something that is often disregarded in data warehouse designs, mainly because it is hard. Bill Inmon’s rival Ralph Kimball calls it the “slowly changing dimension” problem, with some technical mechanisms for how to deal with it, but at an enterprise level, these lessons are often lost. Time variance or “effective dating” (no, this is not like speed dating) is indeed critical in many business applications, and indeed is a key feature of Kalido.

It would indeed be nice if unstructured data mapped neatly into structured data, but here we are rather at the mercy of the database technologies. In principle Oracle and other databases can store images as “blobs” (binary large objects) but in practice very few people really do this, due to the difficulty in accessing them and the inefficiency of storage. Storing XML directly in the DBMS can be done, but brings its own issues, as we can testify at Kalido. Hence I think that the worlds of structured and unstructured data will remain rather separate for the foreseeable future.

The DW 2.0 material also has an excellent section on “the global data warehouse” where he lays out the issues and approaches to tackling deploying a warehouse on a global scale. This is what I term “federation”, and examples of this kind of deployment can be found at Unilever, BP and Shell, amongst others. Again this is a topic that seems to have entirely eluded most analysts, and yet is key to getting a truly global view of the corporation.

Overall it is good to see Bill taking a view and recognizing that data warehouse language and architecture badly needs an update from the 1990s and before. Many serious issues are not well addressed by current data warehouse approaches, and I welcome this overdue airing of the issues. His initiative is quite ambitious, and presumably he is aiming for the same kind of impact on data warehouse architecture as Ted Codd’s rules had on relational database theory (the latter’ “rules” of relational were based on some mathematical theory and were quite rigorous in definition). It is to be hoped that acertificationtoin” process for particular designs or products that Bill develops will be an objective process rather than one based on sponsorship.

More detail on DW 2.0 can be found on Bill’s web site.