The MDM Blues

After living in denial for some time, IBM have got the “multi domain” message about MDM which I have been bleating on about at length for years. They have just announced a repackaging of their MDM offerings under the banner “IBM Infosphere MDM Server”. This puts IBM firmly on the path of a server architecture that can deal with multiple types of MDM data in a consistent manner, not just customer and product but all the many other kinds of master data e.g. location, asset, contract, brand, financial profile, …..IBM has been sensibly enabling their MDM offerings in an SOA context, and MDM Server comes with 800 pre-packaged SOA services that can be invoked. IBM has bought high quality MDM technology and now at last has a strong vision of how to bring it all together.

However it is worth emphasising that this is a roadmap. For now there will remain the separate CDI hub technology (bought from DWL) and the PIM Hub technology (bought from Trigo). Over time these technologies will be integrated with common services, but this is a multi-release strategy. It is great news that IBM has finally realised that multi-domain is the right way to go, but prospects and customers need to reassure themselves about whether the roadmap meets their time horizons.

Nowhere to hide

A Computerworld article highlights the risks that enterprise buyers run in an age of vendor consolidation. In this case the article talks about Peoplesoft and Oracle, but the point is a general one. Just how anxious should software buyers be about their vendor being acquired?

I would argue that the vendor risk issue is frequently overplayed. You may “never get fired by buying IBM” but I recall when IBM dropped its “strategic” 4GL ADF for CSP in the late 1980s, leaving plenty of seriously large customers in the lurch (I worked for Exxon at the time, which had standardised on ADF). There is a risk in any software purchase, not only about whether the vendor will go bust at some point, but as to whether the vendor will continue to maintain and enhance the particular product you are buying. People often agonise about buying software from small vendors, but in the case of a company with one product in their portfolio, you can at least be sure that they will care a lot about that product. An industry giant may have ultra-solid finances, but can decide to drop a product line if it does not do well commercially, or for other internal reasons, as in the IBM example I mentioned. There are numerous other cases e.g. SAP MDM was dumped in favour of a new product based on acquired technology from A2i just a couple of years ago, while Oracle has plenty of “prior” in abandoning acquired product lines that did not meet its view of the world.

I believe that buyers should look at a few things in terms of risk. Look beyond the finances of the vendor to the installed base of the particular product they are buying. A product with hundreds or thousands of enterprise customers is likely to live a lot longer than one with a few. Moreover what is the growth trajectory of the customer base? A fast growing customer base will very likely receive continued investment, either internally in the case of an industry behemoth, or externally from venture capital firms in the case of smaller companies. The situation to be wary of, whatever the vendor size, is where there is a small customer base that is not growing. This situation should send warning bells off, whatever the vendor size. Of course vendors may be very coy about revealing figures, but you can for example try and talk to the chairman of a product user group to get a sense of how well the customer base is growing; a user group with shrinking numbers of attendees would be a worrying sign.

Above all, customers need to ensure that their investment has a clear and rapid payback. If you spend a million dollars in licences, with 20% annual support and 4 million in services putting it in, you should be able to stack up on the other side of the balance sheet the benefits that you are expecting to see. If the benefit case has a payback period of (say) a year, then it is less of an issue to worry about the vendor will be around in ten years time. If you have a choice between a mediocre product from a “safe” vendor and a much more productive product from a smaller riskier, vendor, then you should be able to quantify what the difference in productivity is worth to you. If the better, riskier, technology saves you millions of dollars a year and pays back in eight months v the alternative, then what sense does it make to accept an inferior technology that will actually cost you many millions in poor productivity, however “safe” it may be.

As discussed earlier, very few product lines are completely safe anyway, given the tendency of vendors to cull non-performing product lines and encourage “migration” to newer (read “profitable”) newer products. If you have a fast enough payback then you can be philosophical about a migration a few years down the road. It all comes down to rigorous cost benefit analysis of the software life-cycle, sadly something all too few customers pay proper attention to.

Broadening Information Access

I saw an interesting demo today from Endeca, which bills itself as an “information access” company. Of course ever self-respecting BI company would describe itself in a similar way, but Endeca’s technology is quite different in approach from BI vendors. If you build a data warehouse and then add BI reporting to it, you quickly realise that “ad hoc” reporting by end-users is fine on the prototype with a few hundred records, but less amusing if there are a few hundred millions of records involved. Hence in real life aggregates are pre-calculated, predefined reports are carefully tuned and cubes (e.g. with Cognos Powerplay or similar) are built on common subsets of data that the users are likely to want. There is always a careful trade-off between flexibility and performance. Moreover the unstructured world or documents and emails is pretty much a separate dimension, however much in reality the context of a business transaction may be described by those emails and documents rather than what is stored in the sales order system.

Endeca has a proprietary database engine which is designed to combine both structured and unstructured data in a flexible way. The MDEX engine does not just store metadata such as hierarchies and structures, but also master data such as lists of product codes. It also indexes documents and emails from corporate systems (there are a series of adaptors with the technology). The technology makes much use of in-memory searches and caching to optimise performance. Some of the implementations can be large and complex: one deployed pensions system has 800 million records, while an electronic parts application deployed has 20,000 distinct attributes.

An example of such a system that resonated with me was a “human capital” demo which was based on the idea of a consultancy practice manager. A screen was shown allowing filtering on a range of areas e.g. consultant’s billing rates, availability, location etc. So far this looked just like the kind of thing you could prepare with a BI tool e.g. you could select consultants available in the next two weeks, with a billing rate of such and such, etc, and the list of consultants would dynamically refresh. No big deal. However the next filter was “all consultants based within x miles of Detroit”; the consultant records had been tagged with geocodes and the engine calculated distances from this information. Next a query was made to find all those who also spoke French, this information not being a database index but something buried away in the consultant’s resumes i.e. in unstructured document form. Good luck writing SQL to handle these kinds of filters!

There are plenty of situations where this mix of structured and unstructured information is important, and Endeca has prospered as a company from this dawning realisation. The company has doubled its revenue for five years in a row, and in Q4 2007 did USD 30 million in revenue, two-thirds of this in software licences. With a strong base of retail customers such as Tesco and Walmart, other verticals strongly represented include government, with customers such as the FBI, CIA and NASA, financial services e.g. ABN Amro, and manufacturing e.g. Boeing, Schlumberger. There are now enterprise 500 customers in all.

The recent acquisition of arch-competitor FAST by Microsoft demonstrates how this market is increasingly recognised as key by the industry giants. While there are plenty of competitors out there the only others in the current Gartner Leaders quadrant for this market are FAST, IBM (with Omnifind) and Autonomy, which is much more established in unstructured enterprise search. Endeca has set an impressive pace of growth, and it seems to me that there are plenty of situations in other verticals e.g. healthcare, that could suit its technology.

Data quality whining

The data quality market is a paradoxical one, as I have discussed before. There is a plethora of vendors, yet few have revenues over USD 10 million. Despite this track record of marginalisation, more are popping up all the time. I am aware of 26 separate data quality vendors today, and this excludes the data quality offerings that have been absorbed into larger vendors such as SAS (DataFlux), Informatica (Similarity Systems), IBM (Ascential Quality Stage) and Business Objects (First Logic). Assuming that you care about data quality at all (and too few do) then how do you go about selecting one?

Well, one thing the industry has done itself no favours over is its confusing and technical terminology (if you don’t think terminology that the buyer understands matter, ask French and German wine producers about why Australian and other wine producers are drinking their lunch). A data quality tool may cover several stages:


and let’s just take one stage: matching. Vendors with data matching technology use a variety of techniques to match up candidate data records. These include:

heuristic matching (based on experience)
probabilistic (rules based)
deterministic (based on templates)
empirical (using dictionaries)

and this is not a comprehensive set. I saw an interesting technology today from Netrics which uses a different (patented) matching technology based on “bipartate graphs” (which in fact looked very impressive). How is an end-user buyer to make any sense of this maze? Certainly different data classes may demand different approaches, e.g. customer name and address data is highly structured and may suggest a different approach from much less structured or more complex data (such as product data, or asset data).

I am not sure of the merits of introducing something like a TPC/A benchmark for data quality (such benchmark exercises are tricky to pin down and vendors make great efforts to “game” them). However it would seem that it would not be that hard to take some common data quality issues, set up a set of common errors (transposed letters, missing letters or numbers, spurious additional letters or common misspellings) and try to match some of these up to a sample dataset in a way that compared the various algorithmic approaches, or indeed directly comparing the effectiveness of vendor products. By ensuring that different data types (not just customer name and address) are covered, such an approach may not result in a single “best” approach or product but show where certain approaches shine and others are less well suited. This in itself would be useful information for potential buyers, who at present must try to set up such bake-off comparisons themselves.

In the absence of any industry-wide benchmarks, each potential customer must set up their own benchmarks and attempt to navigate through the maze of arcane terminology, approaches and large number of vendors themselves each time. Such complexity of terminology must increase sales cycles and cause the data quality industry to be less appealing to buyers, who may just give up and just wait a larger vendor to add data quality as a feature (possibly in a manner than is sub-optimal for their particular customer needs).

Consider the wine analogy. If you buy a French wine you must navigate the subtleties of region, village, grower and vintage. For example I am looking right now at a bottle with the label “Grand Vin de Leoville Marquis de Las Cases St Julien Medoc Appellation St Julien Controlee 1975” (it is from Bordeaux, but actually omits this from the label). Alternatively I can glance over to a (lovely) Italian wine from Jermann with the label “Where Dreams have No End”. Both are fine wines, but which is more likely to appeal to the consumer? Which is more inviting? The data quality has something to learn about marketing, in my view, just as the French wine industry has.

The Brits are coming

Not the Oscars this time, but a data warehouse appliance. Teradata carved out a successful high end niche in database and hardware technology specifically aimed at analytic rather than transactional processing, succeeding where previous attempts (e.g. Red Brick, Britton Lee) had faltered. However it is the rapid rise of Netezza that caused a flurry of look-alike appliance vendors to sprout up in the last couple of years such as DatAllegro, Datupia, ParAccel etc. I believe that it will be much easier to convince conservative buyers about appliances if they do not come with proprietary hardware, and indeed this is the approach taken by Datupia. However the software-only appliance route was taken a couple of years earlier by Kognitio (a re-brand of Whitecross). Kognito initially had a proprietary hardware link and had built up some impressive references in the UK such as BT (who have serious data volumes) but had not succeeded as broadly commercially as they might have done; in my view they were held back by the proprietary hardware issue (especially in a conservative UK market). This has been addressed, and a major re-engineering exercise has now allowed their WX2 V6 product to run on commodity X86 hardware such as data blades.

WX2 uses scanning technology, no indexes, and is an RDBMS using hardware parallelism and smart use of memory in preference to disk access where possible to achieve its performance. The product reads in data from a flat file, loads it quickly (1 terabyte an hour) and can then achieve extremely fast read performance. In one test 23 billion rows were read in two seconds. This approach differs from column-oriented databases (e.g. Sybase, ParAccel) whose design can also achieve high performance for certain analytic queries but are inherently less flexible. A typical Kognitio implementation may involve 80 servers in groups of four. Resilience is obviously a key issue for such large data volumes, and the company claims that if you pull a server out of the rack and so artificially crash the system, it is able to restart in a just a few minutes.

The technology does not compete with data quality tools, as it assumes that pre-validation of data has been completed prior to loading. It could be characterised in philosophy as ELT (rather than ETL) since with such fast performance at its disposal it may be more efficient to carry out transformations within the database engine than pre-processing prior to loading. An ODBC interface allows the loaded data to be queries by any normal reporting tool. Against conventional databases such as Oracle, appliances can show dramatic results. In one recent proof concept on a half terabyte sample database, some queries were demonstrated to be 40 times faster than the existing warehouse.

Kognitio already has nearly half its customers on its software as a service model, which I wrote about previously. The more traditional licences result in orders typically in the GBP 300k – 1.2M range. The company has added more solid customer references such as Marks and Spencer and Scottish Power (it has a few dozen customers now), and has grown to 78 employees and around GBP 8 million in revenue, having been profitable for three years. This solid commercial performance has now given it the base to branch out into the massive US market, and it is about to open a head office in Chicago with sales offices in Boston and San Francisco.

Kognitio has the advantage of non-proprietary hardware ties (unlike Netezza) and a solid and lengthy track record of successful reference customers (unlike more recent appliance start-ups), which should be a potent combination if it can crack sales and marketing to the US market.

Finding reports, naturally

Another example of innovation in the seemingly mature world of BI can be found lurking within the unlikely setting of Progress Software (Progress acquired EasyAsk in May 2005). EasyAsk is a product which combines search capability with a natural language interface than can generate SQL to run against data warehouses. This unusual combination has led it to be used in many eCommerce sites, allowing for natural language inquiries to be translated into product offerings from web sites.

However the technology is a natural (excuse the pun) fit for a rather understated but very real problem in large organisations: actually finding existing reports or pieces of analysis. Most large companies have invested in licences of Cognos, Business Objects or other reporting and analysis software, but what happens after the initial project set-up has happened? The implementation consultants typically set up some pre-configured environments (e.g. a Business Objects universe) and perhaps a little training, and end user analysts then supposedly have at the data warehouse with glee. In reality most end users have no desire to learn a tool beyond Excel, so most rely on pre-built reports e.g. monthly sales figures, being set up for them by the IT department. A subset of end-users, typically people with “analyst” somewhere in their job title, are happy to do “ad hoc reporting”, though to be honest most of these characters could make do with a command line SQL interface rather than a fancy reporting tool if push came to shove.

The big issue is one of wasted effort due to lack of re-use. If one analyst spends a few hours coming up with a new take on sales profitability, surely this would be useful for others? Yet generally if a request comes down to produce a report, people start from scratch even if there are already perfectly good reports already produced by someone else in the company. They just do not know they are there.

This is where tools with strong search capability can help. Certainly this is not new, and Autonomy, FAST, Endeca etc can be helpful in tracking down existing information. Yet such tools are really designed for unstructured data rather than structured data. EasyAsk has the advantage that it provides end-users with the ability to do natural language queries if they don’t quite find what they need. The leading BI players have begun to realise how much of an issue this is in recent years e.g. Business Objects purchase of Inxigt. However there is plenty of room for a pure-play alternative, as this is a problem that is barely addressed in most large companies.

One complication that EasyAsk will encounter is a natural hostility in IT departments to natural language interfaces, since hoary DBA types (I started as a DBA, so can say this kind of thing) are never going to trust that a generated piece of SQL from a question like “find me the most profitable sales region” is going to get the right answer. EasyAsk addresses this concern somewhat by having subject dictionaries that are compiled with a domain expert (e.g. in HR this might equate the phrases “laid off” to “let go” to “fired” to “terminated”) in order to give its technology a better chance of formulating the right answer, and of course you can always switch on a trace to see the SQL generated to see what is going on and get it looked over by an IT type. However if a DBA has to check the SQL generated every time before approving a new report then this rather defeats the object of the exercise in the first place.

For this reason EasyAsk probably need to target end-users rather than IT departments, who will probably always be a tough crowd for them. If they can get to the right audience, then addressing the problem of making better use of all those pre-existing canned reports is a very real problem to which a large dollar value can be attached. They seem to have made an impression with customers like GSK, Forbes and BASF, and their technology is already embedded within several other companies’ applications. I recall from my days at Shell that this is a widespread issue in large companies, so exploiting existing BI investment should be a happy hunting ground for companies with the right value proposition.

Last exits?

Happy New Year. There was a useful post in regarding 2007 technology IPOs. Some of these are outside the scope of this column, but what I found interesting was that it seems that enterprise software companies were able to tap the capital markets at levels of revenue and profitability unseen over the last few years. A couple of years back the message from investment bankers was that you needed quarterly revenues of not less than USD20 million (and preferably more) and several quarters of profitability before even considering an IPO. Yet Netezza’s IPO got away OK despite a lack of profitability (though strong growth), while Sourcefire had quarterly revenue of under USD 15M and was still not profitable, yet also managed an IPO. It looks as if the markets have taken a slightly harder view since then judging by the early performance of these shares, but these IPOs would simply not have happened in 2004 or 2005.

What is less clear is whether 2008 will show the same softening of view, or whether the financial debt crisis afflicting banks will have collateral damage in the IPO market. Software companies and their backers will be hoping for a continued thaw rather than a return to the wintry outlook the capital markets have seen in the past few years.