Philosophy and data warehouses

Database expert Colin White wrote a prvocative article the other day:

in which he ponders whether a data warehouse is really needed for business intelligence. This is an interesting question; after all, why did we end up with data warehouses in the first place rather than just querying the data at source? (which is surely a simpler idea). There seem to me to be a few reasons:

(a) Technical performance. Early relational databases did not like dealing with mixed workloads of transaction update and queries, as locking strategies caused performance headaches.

(b) Storage issues. Data warehouses typically need several years of data, whereas transaction systems do not, so archiving transaction after a few months has a performance benefit and may allow use of cheaper storage media.

(c) Inconsistent master data between transaction systems (who owns “product” and “customer” and “asset” and the like) means that it is semantically difficult to query systems across departments or subsidiaries. Pulling the data together into a data warehouse and somehow mashing it into a consistent structure fixes this.

(d) You may want to store certain BI related data e.g. aggregates or “what-if” information that is useful purely for analysis and is not relevant to transaction systems. A data warehouse may be a good place to do this.

(e) People have trouble finding existing pre-built reports, so having a single place where these live makes re-use easier.

(f) Data quality problems in operational systems mean that a data warehouse is needed to allow data cleansing before analysis.

I think that you can make a case that technology is making some strides to address certain of these areas. In the case of (e), the application of Google and similar search mechanisms (e.g. FAST) to the world of BI may reduce or eliminate problem (e) altogether. Databases have become a lot more tolerant of mixed workloads, addressing problem (a), and storage gets cheaper, attacking problem (b). It doesn’t seem to me that you necessarily have to store what-if type data in a data warehouse, so maybe (d) can be tackled in other ways. Even problem (f), while a long way from being fixed, at least has some potential now that some data quality tools are allowing SOA-style embedding within operational systems, thus holding out the possibility of fixing many data quality issues at source.

If we then take all the master data out of the data warehouse and put in into a master data repository would this not also fix (c)? Well, it might, but regrettably this discipline is still in its infancy, and it seems to me that plucking data out of transaction systems into specific hubs like a “customer hub” or a “product hub” may not be improvoing the situation at all, as indeed Coln acknowledges.

Where I differ from Colin is on his view that a series of data marts combined with a master data store may be the answer. Since data marts are subject specific by definition, they may address a certain subset of needs very well, but cannot address enterprise-wide analysis. This type of analysis can only be done by something with access to potentially all the data in an enterprise, and be capable of resolving master data issues across the source systems. Here a data warehouse in conjunction iwth a master data store makes more sense to me than a series of marts plus a master data store – why perpetuate the issues? I have no problem if the data marts are dependent i.e. generated from the warehouse e.g. for convenience/performance. But as soon as they are maintained outside a controlled environment you come back to problem (c) again.

Sadly, though some of the recent technical improvements point the way to the solution of problems (a) through (f), the reality on the ground is a long way off allowing this. For example the data quality tools could be embedded via SOA into operatonal systems and linked up to a master data store to fix master data qualit issues, but how many companies have done this at all, let alone across more than a pilot system? Master data repositories are typically still stuck in a “hub mentality” that means they are at best, as Colin puts it, “papering over the cracks of master data problems”. Moreover most data warehouses are still poorly designed to deal with historical data and cope with business change.

Hence I can’t see data warehouses going away any time soon. Still, it is a useful exercise to remind ourselves why we built them in the first place. Questioning the meaning of existence is called ontology, which ironically has now been adopted as a term by computer science to mean a data model that represents concepts within a domain and the relationship between those concepts. We seem to have come full circle, a suitable state for the end of the week.
Have a good weekend.

3 thoughts on “Philosophy and data warehouses”

  1. Another reason a data warehouse may be necessary over the source system is related to your item (b). Transactional systems often don’t contain all historical information, in archived files or otherwise. Oftentimes, a transactional system may simply overwrite a data field when new data arrives. This may be in accordance with the business rules and the “lost” data may be meaningless in the context of the transactional system. For example, a CRM system is concerned mostly with the current status of trouble tickets in the aggregate, and less so with what the status was three weeks ago. But a data warehouse is often designed to allow for historical trending. In that case, it’s important to know there were 400 tickets on hold on January 1 and 350 tickets on hold February 1.

  2. Even if the data warehouse is, in part at least, ‘virtual’ (in the sense that it may access data from operational systems) there is still potentially a significant requirement to be able to build a data warehouse infrastructure to model the business from the viewpoint of the business user in order to be able to query across systems. (One may choose, of course to use a term other than ‘data warehouse’).

  3. “data warehousing” is really about logistics. And just as in the world of “real” logistics it is about designing a process in which transport, storage (de-coupling points between different types of transport) and manipulation has to be arranged in such a way that customerneeds are satisfied at the desired level of quality, relevancy and – above all – timeliness. These demands need not only to be satisfied in the initial project but also in the ever changing world of the average multi national. That could mean that in a simple process no de-coupling at all is needed and – as is the case im most more complex situations – a much more complex combination of data warehouse(s) mind the plural and datamarts is more than necessary. I tend to believe that even if and when the master data problem is solved (and as Andy rightfully points out that will take some time yet) there are still good reasons to see this type of de-coupling (and integration points) for some time to come (never say never). And yes it very interesting to speculate on a “brave new world” for BI where is no need for this.

Comments are closed.