Gazing Behind the Data Mirror

I have been digging a little deeper into the Data Mirror purchase by IBM that I wrote about yesterday.

It’s a good deal for IBM, and not only because the price was quite fair. With its Ascential acquisition IBM positioned itself directly against Informatica, yet Ascential’s technology did not have the serious real-time replication that is important for the future of ETL, and this is what Data Mirror does have. DataMirror gives IBM a working product with heterogeneous data source support in real time, giving IBM an important piece in the puzzle to achieve their vision for real-time operational BI and event-awareness.

A bigger question is whether IBM fully understands what it has bought and whether it will properly exploit it. Data Mirror’s strengths were modest pricing, low-impact installation, neutrality of sources it supports and performance doing this (via its log-scraping abilities and speed of applying changes). IBM must keep their eye on the development ball to ensure these aspects of the DataMirror technology are continued if it is to really exploit its purchase. For example, on the last point, the partnerships DataMirror has with Teradata and Netezza and Oracle should be continued, despite the obviously temptation to snub rivals Oracle and Teradata.

Any acquisition creates uncertainty amongst staff, and IBM needs to move beyond motherhood reassurance to show staff that it understands the DataMirror technology and business and wants to see it thrive and grow. It needs to explain how the DataMirror technology fits within a broader vision for real-time integration in combination with traditional batch oriented ETL, business intelligence and enterprise service bus (not just MQSeries) integration or else the critical technical and visionary people will dust off their resumes and start looking elsewhere.

I gather that IBM has already announced an internal town hall meeting next week, at which it needs to convince key technical staff that they have a bright future within the IBM family. I also hear that no hiring freeze has been imposed, which implies they are making the decision of growing the business, which should reassure people. IBM is an experienced company which will recognise that the true IP of a company is not in libraries of code but in the heads of a limited number of individuals, and no doubt will recognise the need to retain and motivate critical staff. It used to be poor at this (think about the brilliant technology it acquired when it bought Metaphor many years ago, but bungled the follow-up) but has got smarter in recent years e.g. I hear from DWL people that they have been treated well.

Hopefully IBM’s more recent and happier acquisition experiences will be the case here.

Mirror, mirror on the wall, who is most blue of them all?

on Monday IBM announced it would buy DataMirror, a Canadian software company. Data Mirror made its living by selling software that detects change in data sources and then managing replication. It differed from other ETL technology in being designed from the ground up to work in real-time rather than batch, which made it well suited to some customer situations, and the software was modestly priced. The technology was also used by some customers for backup and business continuity reasons. It had a large customer base (well over 2,000).

For IBM the acquisition adds some solid technology to its data warehouse offering and its “on demand” strategy, in this case replacing Powerpoint promises with something that actually works. Datamirror was publicly traded on the Toronto stock exchange. It did $46.5 million in revenue last year and was hoping for $55 million in fiscal year 2008, so this was a company that was delivering solid though unspectacular growth, though its share price had doubled in the last twelve months. IBM’s price of $162 million is over three times trailing revenues and so is a healthy valuation for the company, and a small premium to its stock market valuation of last week.

If all you have is a hammer…

Claudia Imhoff raises an important issue in her blog regarding the cleansing of data. When dealing with a data warehouse it is important for data to be validated before being loaded into the warehouse in order to remove any data quality problems (of course, ideally you would have a process to go back and fix the problems at source also). However, as she points out, in some cases e.g. for audit purposes, it is actually important to know what the original data actually was, not just a cleansed version. This issue gets at the heart of a vital issue surrounding master data, and neatly illustrates the difference between a master data repository and a data warehouse.

In MDM it is accepted (at least by those who have experience of real MDM projects) that master data will go through different versions before producing a “golden copy”, which would be suitable for putting into a data warehouse. A new marketing product hierarchy may have to go through several drafts and levels of sign-off before a new version is authorised and published, and the same is true of things like drafts of budget plans, which go through various iterations before a final version is agreed. This is quite apart from actual errors in data, which are all too common in operational systems. An MDM application should be able to mange the workflow of such processes, and have a repository that is capable of going back in time and tracking the various versions, not just the finished golden copy. A good MDM repository should allow you to track back through master data as it is “improved” over time, not just look at the golden copy. The golden copy only should be exported to the data warehouse, where data integrity is vital.

People working on data warehouse projects may not be aware of such compliance issues, as they usually care only about the finished state warehouse data. MDM projects should always be considering this issue, and your technology selection should reflect the need for your MDM technology to track versions of master data over time.

I see a tall dark stranger in your future….

There is an interesting article in CIO Insight by Peter Fade, a professor of marketing at the top-rated Wharton Business School. in this he discusses the limitations of data mining, and it is an article that anyone contemplating investing in this technology should read carefully. I set up a small data mining practice when I was running a consulting division at Shell, and found it a thankless job. Although I had an articulate and smart data mining expert and we invested in what at the time was a high quality data mining tool, we found time and again that it was very hard to find real-world problems where the benefits of data mining could be shown. Either the data was such a mess that little sense could be made of it, or the insights shown by the data mining technology were, as Homer Simpson might say, of the “well, duh” variety.

Professor Faber argues that in most cases the best you can hope for is to develop simple probabilistic models of aggregate behaviour, and you simply cannot get down to the level of predicting individual behaviour using the level of data that we typically have, however alluring the sales demonstrations may be. Moreover, such models can mostly be built in Excel and don’t need large investments in sophisticated data mining tools.

While I am sure there are some very real examples where data mining can work well e.g. why some groups of people are better credit risks than others, the main point he makes is that the vision of 1-1 marketing via a data mining tool is a fantasy, and that the tools have been seriously oversold. Well, that is something that we in the software industry really do understand. We all want technology to provide magical insights into a messy and complex world that is hard to predict. Unfortunately the technology at present is generally as useful as a crystal ball when it comes to predicting individual behaviour. Yet there is still that urge to go into the tent and peer into the mists of the crystal ball in search of patterns.

A new twist to appliances

I wondered what Foster Hinshaw would get up to after he left Netezza, and now we know. He has set up the rather awkwardly named Dataupia, a data warehouse appliance with a difference. It is an important difference, as his appliance runs on Oracle rather than on a proprietary database like Netezza. It will also run on DB2 or SQL Server, for that matter. You just plug in MPP capable hardware to take advantage of the appliance. This is important, since having a proprietary database brings with it not only a certain amount of cost and new skills required, but also makes conservative corporate buyers nervous. If you are a Telco with really vast amounts of transaction data then this trade off may be worthwhile, as indeed can be seen in Netezza’s considerable success, but if you could get much of the benefit (and this is unclear since at this stage there are no comparative performance figures) while still running on your existing mainstream database, this would sooth the nerves of corporate CIO types who might otherwise try and block the introduction of a new database. Just as importantly, it allows existing data warehouse applications to be able to claim appliance like performance boosts. While the vast bulk of data warehouses today are custom built, this ought to be of interest to true data warehouse applications such as Kalido, which could presumably easily run on top of Dataupia’s appliance.

I think this is a very interesting development, assuming that the new product delivers on its promise. The market for an appliance capable of running on a mainstream database platform ought to be much broader than the set of applications that currently addressed by hardware appliances (or even software-based ones with their own database like Kognitio).

EII – dead and now buried

The most widely publicised piece that I wrote was “EII Dead on Arrival” back in July 2004. Metamatrix was the company that launched the term on the back of heavy funding from top end VCs, and I wrote previously about what seemed to me to be its almost inevitable struggles. There was some controversy over my article, which differed from the usual breathless press coverage which was associated with EII at the time (our industry does love a new trend and acronym, whatever the reality may be). I could never see how it could work outside a very limited set of reporting needs. Well, as they say on Red Dwarf: “smug mode”.

Gravity finally caught up with marketing hype this week, and Metamatrix will be bought by Red Hat and made into open source. It would have been interesting to know what the purchase price was, but Red Hat were keeping quiet about that. It is a fair bet that it was not a large sum of money. Kleiner Perkins won’t be chalking this up as one of their smarter bets.

Philosophy and data warehouses

Database expert Colin White wrote a prvocative article the other day:

in which he ponders whether a data warehouse is really needed for business intelligence. This is an interesting question; after all, why did we end up with data warehouses in the first place rather than just querying the data at source? (which is surely a simpler idea). There seem to me to be a few reasons:

(a) Technical performance. Early relational databases did not like dealing with mixed workloads of transaction update and queries, as locking strategies caused performance headaches.

(b) Storage issues. Data warehouses typically need several years of data, whereas transaction systems do not, so archiving transaction after a few months has a performance benefit and may allow use of cheaper storage media.

(c) Inconsistent master data between transaction systems (who owns “product” and “customer” and “asset” and the like) means that it is semantically difficult to query systems across departments or subsidiaries. Pulling the data together into a data warehouse and somehow mashing it into a consistent structure fixes this.

(d) You may want to store certain BI related data e.g. aggregates or “what-if” information that is useful purely for analysis and is not relevant to transaction systems. A data warehouse may be a good place to do this.

(e) People have trouble finding existing pre-built reports, so having a single place where these live makes re-use easier.

(f) Data quality problems in operational systems mean that a data warehouse is needed to allow data cleansing before analysis.

I think that you can make a case that technology is making some strides to address certain of these areas. In the case of (e), the application of Google and similar search mechanisms (e.g. FAST) to the world of BI may reduce or eliminate problem (e) altogether. Databases have become a lot more tolerant of mixed workloads, addressing problem (a), and storage gets cheaper, attacking problem (b). It doesn’t seem to me that you necessarily have to store what-if type data in a data warehouse, so maybe (d) can be tackled in other ways. Even problem (f), while a long way from being fixed, at least has some potential now that some data quality tools are allowing SOA-style embedding within operational systems, thus holding out the possibility of fixing many data quality issues at source.

If we then take all the master data out of the data warehouse and put in into a master data repository would this not also fix (c)? Well, it might, but regrettably this discipline is still in its infancy, and it seems to me that plucking data out of transaction systems into specific hubs like a “customer hub” or a “product hub” may not be improvoing the situation at all, as indeed Coln acknowledges.

Where I differ from Colin is on his view that a series of data marts combined with a master data store may be the answer. Since data marts are subject specific by definition, they may address a certain subset of needs very well, but cannot address enterprise-wide analysis. This type of analysis can only be done by something with access to potentially all the data in an enterprise, and be capable of resolving master data issues across the source systems. Here a data warehouse in conjunction iwth a master data store makes more sense to me than a series of marts plus a master data store – why perpetuate the issues? I have no problem if the data marts are dependent i.e. generated from the warehouse e.g. for convenience/performance. But as soon as they are maintained outside a controlled environment you come back to problem (c) again.

Sadly, though some of the recent technical improvements point the way to the solution of problems (a) through (f), the reality on the ground is a long way off allowing this. For example the data quality tools could be embedded via SOA into operatonal systems and linked up to a master data store to fix master data qualit issues, but how many companies have done this at all, let alone across more than a pilot system? Master data repositories are typically still stuck in a “hub mentality” that means they are at best, as Colin puts it, “papering over the cracks of master data problems”. Moreover most data warehouses are still poorly designed to deal with historical data and cope with business change.

Hence I can’t see data warehouses going away any time soon. Still, it is a useful exercise to remind ourselves why we built them in the first place. Questioning the meaning of existence is called ontology, which ironically has now been adopted as a term by computer science to mean a data model that represents concepts within a domain and the relationship between those concepts. We seem to have come full circle, a suitable state for the end of the week.
Have a good weekend.

Just Singing The Blues

There was a curious piece of “analysis” that appeared a few days ago in response to IBM’s latest data warehouse announcments:;jsessionid=ZOETRZHF0SWBMQSNDLOSKH0CJUNN2JVN?articleID=198000675

In this gushing piece, Current Analysis analyst James Kobelius says: “IBM with these announcements becomes the premiere data warehousing appliance vendor, in terms of the range of targeted solutions they provide”. So, what were all these new innovative products that appeared?

Well, IBM renamed the hideous “Balanced Configuration Units” (you what now?) to “Balanced Warehouse”, a better name for sure. Also in the renaming frame was “Enterprise Class” being renamed to “E Class” (hope they didn’t spend too many dollars on that one). In fact the only supposedly “new” software at all that is apparent is the OmniFind Analytics Edition. The analysis credits this as a new piece of software, which will come as a surprise to many of us with memories longer than a mayfly e.g. the following announcement of Ominifind 8.3 is on the IBM website dated December 2005:

In fact the whole release seems to be around repackaging and repricing, which is all well and good but hardly transports IBM to some new level it wasn’t at, say, a week ago.

Let’s not forget about “new services” such as “implementation of an IBM data warehouse” – well, that certainly was something that never crossed IBM’s Global Services mind before last week. Now, I’m not a betting man, but I would be prepared to wager a dollar that IBM have a contract with Current Analysis – any takers against?

The excellent blog “The Cranky PM” does a fine job of poking fun at the supposedly objective views of analyst firms that are actually taking thick wads of dollars from the vendors hat they are analysing e.g.

I wonder what she would make of this particular piece of insight?

Deja vu all over again

There is some good old fashioned common sense in an article by John Ladley in DM Review:

where he rightly points out that although most companies are now on their second or third attempt at data warehouses, they seem not to have learnt from the past and hence seem doomed to repeat prior mistakes. Certainly a common thread in my experience is the desire of IT departments to second guess what their customers need and end up making life unnecessarily hard for themselves. If you ask a customer how long he needs access to his detailed data he will say “forever” and if you ask how real time it needs to be of then of course he would love it to be instantaneous on a global basis. What is often not presented is the trade off: “well, you can have all the data kept forever, but the project costs will go up 30% and your reporting performance will be significantly worse than if you can make do with one year of detailed data and summaries prior to this”. In such a case the user might well change his view on how critical the “forever” requirement was.

This disconnect between corporate IT departments and the business continues to be a wide one. I recently did some work for a global company where a strategy session was to decide the IT architecture to support a major MDM initiative. None of the business people had even bothered to invite the internal IT department, such was the low regard in which it was held. Without good engagement between IT and business data warehouse projects will struggle, however up to date the technology used is.

Mr Ladley is also spot on regarding data quality – it is always much worse than people imagine. “Ah, but the source is our new SAP system so the data is clean” is the kind of naive comment that many of us will recognise. At one project at Shell a few years ago it was discovered that 80% of the pack/product combinations being sold in Lubricants were duplicated somewhere else in the Group. At least that could be partially excused by a decentralised organisation. Yet it also turned out that of a commercial customer database of 20,000 records, only 5,000 were truly unique, and this was in one operating company. Imagine the fun with deliveries and invoice payment that could ensue.

Certainly data warehouse projects these days have the advantage of more reliable technology and fuller function products than ten years ago, meaning less custom coding is required than used to be the case. However the basic project management lessons never change.

Appliances are proving popular

There is a useful overview of the growing appliance market in Computer Business Review:

The appliance market is nothing if not growing, with no fewer than ten appliance vendors now identified by analyst Madan Sheina (who by the way, is one of the smarter analysts out there). Of course apart from Teradata many of these are small or very new. Teradata accounts for about USD 1 billion in total revenue (the accounts will become much clearer once they separate from NCR) though this includes services and support, not just licences. The next largest vendor is Netezza, who does not publish their revenue (though I would estimate over USD 50M). Kognitio used to be around USD 5M in revenue, though they seemed perky when I last spoke to them so may be a little bigger now. DataAllegro will certainly be smaller than Netezza, as will be the other new players. It is too early to say how well HP’s Neoview appliance will do, though clearly HP has plenty of brand and channel clout, especially now that it has acquired Knightsbridge.

Still, so many entrants to a market certainly tell you that plenty of people feel that money can be made. So far Teradata and Netezza have had the field pretty much to themselves, but the entrance of HP and the various newer vendors will create greater competition, which ultimately can only be of benefit to customers.