Broadening Information Access

I saw an interesting demo today from Endeca, which bills itself as an “information access” company. Of course ever self-respecting BI company would describe itself in a similar way, but Endeca’s technology is quite different in approach from BI vendors. If you build a data warehouse and then add BI reporting to it, you quickly realise that “ad hoc” reporting by end-users is fine on the prototype with a few hundred records, but less amusing if there are a few hundred millions of records involved. Hence in real life aggregates are pre-calculated, predefined reports are carefully tuned and cubes (e.g. with Cognos Powerplay or similar) are built on common subsets of data that the users are likely to want. There is always a careful trade-off between flexibility and performance. Moreover the unstructured world or documents and emails is pretty much a separate dimension, however much in reality the context of a business transaction may be described by those emails and documents rather than what is stored in the sales order system.

Endeca has a proprietary database engine which is designed to combine both structured and unstructured data in a flexible way. The MDEX engine does not just store metadata such as hierarchies and structures, but also master data such as lists of product codes. It also indexes documents and emails from corporate systems (there are a series of adaptors with the technology). The technology makes much use of in-memory searches and caching to optimise performance. Some of the implementations can be large and complex: one deployed pensions system has 800 million records, while an electronic parts application deployed has 20,000 distinct attributes.

An example of such a system that resonated with me was a “human capital” demo which was based on the idea of a consultancy practice manager. A screen was shown allowing filtering on a range of areas e.g. consultant’s billing rates, availability, location etc. So far this looked just like the kind of thing you could prepare with a BI tool e.g. you could select consultants available in the next two weeks, with a billing rate of such and such, etc, and the list of consultants would dynamically refresh. No big deal. However the next filter was “all consultants based within x miles of Detroit”; the consultant records had been tagged with geocodes and the engine calculated distances from this information. Next a query was made to find all those who also spoke French, this information not being a database index but something buried away in the consultant’s resumes i.e. in unstructured document form. Good luck writing SQL to handle these kinds of filters!

There are plenty of situations where this mix of structured and unstructured information is important, and Endeca has prospered as a company from this dawning realisation. The company has doubled its revenue for five years in a row, and in Q4 2007 did USD 30 million in revenue, two-thirds of this in software licences. With a strong base of retail customers such as Tesco and Walmart, other verticals strongly represented include government, with customers such as the FBI, CIA and NASA, financial services e.g. ABN Amro, and manufacturing e.g. Boeing, Schlumberger. There are now enterprise 500 customers in all.

The recent acquisition of arch-competitor FAST by Microsoft demonstrates how this market is increasingly recognised as key by the industry giants. While there are plenty of competitors out there the only others in the current Gartner Leaders quadrant for this market are FAST, IBM (with Omnifind) and Autonomy, which is much more established in unstructured enterprise search. Endeca has set an impressive pace of growth, and it seems to me that there are plenty of situations in other verticals e.g. healthcare, that could suit its technology.