Data quality whining

The data quality market is a paradoxical one, as I have discussed before. There is a plethora of vendors, yet few have revenues over USD 10 million. Despite this track record of marginalisation, more are popping up all the time. I am aware of 26 separate data quality vendors today, and this excludes the data quality offerings that have been absorbed into larger vendors such as SAS (DataFlux), Informatica (Similarity Systems), IBM (Ascential Quality Stage) and Business Objects (First Logic). Assuming that you care about data quality at all (and too few do) then how do you go about selecting one?

Well, one thing the industry has done itself no favours over is its confusing and technical terminology (if you don’t think terminology that the buyer understands matter, ask French and German wine producers about why Australian and other wine producers are drinking their lunch). A data quality tool may cover several stages:

discovery
profiling
matching
enrichment
consolidation
monitoring

and let’s just take one stage: matching. Vendors with data matching technology use a variety of techniques to match up candidate data records. These include:

heuristic matching (based on experience)
probabilistic (rules based)
deterministic (based on templates)
empirical (using dictionaries)

and this is not a comprehensive set. I saw an interesting technology today from Netrics which uses a different (patented) matching technology based on “bipartate graphs” (which in fact looked very impressive). How is an end-user buyer to make any sense of this maze? Certainly different data classes may demand different approaches, e.g. customer name and address data is highly structured and may suggest a different approach from much less structured or more complex data (such as product data, or asset data).

I am not sure of the merits of introducing something like a TPC/A benchmark for data quality (such benchmark exercises are tricky to pin down and vendors make great efforts to “game” them). However it would seem that it would not be that hard to take some common data quality issues, set up a set of common errors (transposed letters, missing letters or numbers, spurious additional letters or common misspellings) and try to match some of these up to a sample dataset in a way that compared the various algorithmic approaches, or indeed directly comparing the effectiveness of vendor products. By ensuring that different data types (not just customer name and address) are covered, such an approach may not result in a single “best” approach or product but show where certain approaches shine and others are less well suited. This in itself would be useful information for potential buyers, who at present must try to set up such bake-off comparisons themselves.

In the absence of any industry-wide benchmarks, each potential customer must set up their own benchmarks and attempt to navigate through the maze of arcane terminology, approaches and large number of vendors themselves each time. Such complexity of terminology must increase sales cycles and cause the data quality industry to be less appealing to buyers, who may just give up and just wait a larger vendor to add data quality as a feature (possibly in a manner than is sub-optimal for their particular customer needs).

Consider the wine analogy. If you buy a French wine you must navigate the subtleties of region, village, grower and vintage. For example I am looking right now at a bottle with the label “Grand Vin de Leoville Marquis de Las Cases St Julien Medoc Appellation St Julien Controlee 1975” (it is from Bordeaux, but actually omits this from the label). Alternatively I can glance over to a (lovely) Italian wine from Jermann with the label “Where Dreams have No End”. Both are fine wines, but which is more likely to appeal to the consumer? Which is more inviting? The data quality has something to learn about marketing, in my view, just as the French wine industry has.