Andy on Enterprise Software

Iteration is the key

March 30, 2006


Ken Pohl writes a thoughtful article on the issues of project management of a data warehouse project, and how this can differ from other IT projects. As he points out, a data warehouse project is unusual in that it is essentially never finished – there are always new sources to add, new types of analysis the customers want etc (at least there are if the project is a success: if it failed then at least you won’t have too many of those pesky customer enhancement requests).

As the article points out, a data warehouse project is ideal for an iterative approach to development. The traditional “waterfall” approach whereby the requirements are documented at ever greater levels of detail, from feasibility through to requirements through to functional specification etc is an awkward approach. I have observed that in some companies the IT departments have a rigid approach to project management, demanding that all types of projects follow a waterfall structure. This is unfortunate in the case of data warehouse projects, where end-users are often hazy on requirements until they see the data, and where changing requirements will inevitable derail the neatest functional specification document (see diagram).
Given a 16 month average elapsed time for a data warehouse project (TDWI) it is almost certain that at least one, and possibly several, major changes will come along that have significant impact on the project, which in a waterfall approach will at the very least cause delays and may put the entire project at risk.

By contrast a data warehouse project that bites off scope in limited chunks, while retaining a broad and robust enterprise business model, can deliver incremental value to its customers, fixing things as needed before the end users become cynical, and gradually building political credibility for the warehouse project. Of course the more responsive to change your data warehouse is the better, but even for a traditional custom build it should be possible to segment the project delivery into manageable chunks and deliver incrementally. The data warehouse projects which I have seen go wrong are very often those which have stuck to a rigid waterfall approach, which makes perfect sense for a transaction processing system (where requirements are much more stable) but is asking for trouble in a data warehouse project. Ken Pohl’s article contains some useful tips, and is well worth reading.

Unifying data

I can recall back in the early 1990s hearing that the worlds of structured and unstructured data were about to converge. A decade on, and despite the advent of XML, and that prospect still looks a long way off. It is like watching two people who have known each either for years and are attracted to each other, yet never seem to find a way of getting together. Some have argued that the data warehouse should simply open up to store unstructured data, but does this really make sense? When DBMS vendors brought out features allowing them to store BLOBS (binary large objects) the question should have been asked: why is this useful? Can I query this and combine it usefully with other data? Data warehouses deal with numbers (usually business transactions) that can be added up in a variety of ways, according to various sets of business rules (such as cost allocation rules, or the sequence of a hierarchy), which these days can be termed master data. The master data gives the transaction data “structure”. A Powerpoint slide or a word document or an audio clip tends not to have much in the way of structure, which is why document management systems place emphasis on attaching keywords or tags to such files in order to give them structure (just as web pages are given similar tags, or at least they are if you want them to appear high up in the search engines).

You could store files of this type in a data warehouse, but given that these things cannot be added up there is little point in treating them as transactions. Instead we can consider them to be master data of a sort. Hence it is reasonable to want to manage them from a master data repository, though this may or may not be relevant to a data warehouse application.

I am grateful to Chris Angus for pointing out that there is a problem with the terms ‘structured data’ and ‘unstructured data’. Historically the terms came into being to differentiate between data that could at that time be stuffed in a database and data that could not. That distinction is nothing like as important now and the semantics have shifted. The distinction is now more between data constrained by some form of fixed schema and whose structure is dictated by a computer application v data/documents not constrained in the same way. An interesting example of “unstructured data” that is a subject in its own right and needs managing is a health and safety notice. This is certainly not just a set of numbers, but it does have structure, and may well be related to other structured data e.g. HSE statistics. Hence this type of data may well need to be managed in master data management application. Another example is the technical data sheets than go with some products, such as lubricants; again, these have structure and are clearly related to a traditional type of master data, in this case “product”, which will have transactions associated with it. Yet another would be a pharmaceutical regulatory document. Hence “structure” is more of a continuum than a “yes/no” state.

So, while the lines are blurring the place to reconcile these two worlds may not be in the data warehouse, but in the master data repository. Just as in the case of other master data, for practical purposes you may want to store the data itself elsewhere and maintain links to it e.g. a DMBS might not be an efficient place to store a video clip, but you would want to keep track of it from within your master data repository.