I can recall back in the early 1990s hearing that the worlds of structured and unstructured data were about to converge. A decade on, and despite the advent of XML, and that prospect still looks a long way off. It is like watching two people who have known each either for years and are attracted to each other, yet never seem to find a way of getting together. Some have argued that the data warehouse should simply open up to store unstructured data, but does this really make sense? When DBMS vendors brought out features allowing them to store BLOBS (binary large objects) the question should have been asked: why is this useful? Can I query this and combine it usefully with other data? Data warehouses deal with numbers (usually business transactions) that can be added up in a variety of ways, according to various sets of business rules (such as cost allocation rules, or the sequence of a hierarchy), which these days can be termed master data. The master data gives the transaction data “structure”. A Powerpoint slide or a word document or an audio clip tends not to have much in the way of structure, which is why document management systems place emphasis on attaching keywords or tags to such files in order to give them structure (just as web pages are given similar tags, or at least they are if you want them to appear high up in the search engines).
You could store files of this type in a data warehouse, but given that these things cannot be added up there is little point in treating them as transactions. Instead we can consider them to be master data of a sort. Hence it is reasonable to want to manage them from a master data repository, though this may or may not be relevant to a data warehouse application.
I am grateful to Chris Angus for pointing out that there is a problem with the terms ‘structured data’ and ‘unstructured data’. Historically the terms came into being to differentiate between data that could at that time be stuffed in a database and data that could not. That distinction is nothing like as important now and the semantics have shifted. The distinction is now more between data constrained by some form of fixed schema and whose structure is dictated by a computer application v data/documents not constrained in the same way. An interesting example of “unstructured data” that is a subject in its own right and needs managing is a health and safety notice. This is certainly not just a set of numbers, but it does have structure, and may well be related to other structured data e.g. HSE statistics. Hence this type of data may well need to be managed in master data management application. Another example is the technical data sheets than go with some products, such as lubricants; again, these have structure and are clearly related to a traditional type of master data, in this case “product”, which will have transactions associated with it. Yet another would be a pharmaceutical regulatory document. Hence “structure” is more of a continuum than a “yes/no” state.
So, while the lines are blurring the place to reconcile these two worlds may not be in the data warehouse, but in the master data repository. Just as in the case of other master data, for practical purposes you may want to store the data itself elsewhere and maintain links to it e.g. a DMBS might not be an efficient place to store a video clip, but you would want to keep track of it from within your master data repository.
An article in CRM Buyer makes an important point. It highlights that a key reason why customer data integration projects fail is the inflexibility of data model that is often implemented. Although the article turns out to be a thinly disguised advert for Siperian, the point is very valid. Traditional entity-relationship modeling typically is at too low level of abstraction. For example courses on data modeling frequently give examples like “customer” and “supplier” as separate logical entities. If your design is based on such an assumption, then applications based on this will struggle if one day a customer becomes a supplier, or vice versa. Better to have a higher level entity called “organization”, which can have varying roles, such as customer or supplier, or indeed others than you may not have thought of at the time of the modeling. Similarly, rather than having an entity called “employee” it is better to have one called “person”, which itself can have a role of “employee” but also other roles, perhaps “customer”for example.
This higher level of data modeling is critical to retaining flexibility in systems, removing the “brittleness” that so often causes problems in reality. If you have not seen it, I highly recommend a paper on business modeling produced by Bruce Ottmann, one of the world’s leading data modelers and whose work has found its away into a number of ISO standards. Although Bruce works for Kalido, this whitepaper is not specific to Kalido but rather discusses the implications of a more generic approach to data models.
I very much hope that the so-called “generic modeling” approach that Bruce recommends will find its way into more software technologies. Examples where it does are Kalido and Lazy Software, and, although in idea rather than product form, in the ISO standard 10303-11, which covers a modeling language called Express that can be used to represent generic data models. It came about through work originated at Shell and then extended to a broader community of data modelers, including various academics, and was particularly aimed at addressing the problem of exchanging product models; it is known as STEP. However the generic modeling ideas developed with this have much broader application than product data. Given the very real advantages that generic modeling offers, it is to be hoped that more software vendors pick up on these notions, which make a real difference to the flexibility of data models, and hence improve the chances of projects, such as CDI projects, actually working in practice.
One of the perennial issues that dogs IT departments is the gap between customer expectations and the IT system that is actually delivered. There are many causes of this e.g. the long gap between “functional spec” and actual delivery, but one that is rarely discussed is he language of the business model. When a systems analyst specifies a system they will typically draw up a logical data model and a process model to describe the system to be built. The standard way of doing the former is witti entity relationship modelling, which is well established but has one major drawback in my experiences: business people don’t get it. Shell made some excellent progress in the 1990s at trying to get business people to agree on a common data model for the various parts of Shell’s business, a thankless task in itself. What was interesting was that they had to drop the idea of using “standard” entity relationship modelling to do it, as the business people at Shell just could not relate to it.
At that time two very experienced consultants at Shell, Bruce Ottmann and Matthew West, did some ground-breaking research into advanced data modelling that was offered to the public domain and became ISO standard 15926. One side effect of the research was a different notation to describe the data used by a business, which turned out to be a lot more intuitive than the traditional ER models implemented in tools like Erwin. This notation, and much else besides is described in an excellent whitepaper by Bruce Ottmann (who is now with Kalido).
We use this notational form at Kalido when delivering projects to clients as diverse as Unilever, HBOS and Intelsat, and have found it very effective in communicating between IT and business people. The notation itself is public domain and not specific to Kalido, and I’d encourage you to read the whitepaper and try it out for yourself.