Size isn’t everything

An October 2005 survey by IT Toolbox shows that, even amongst large companies, the size of the corporate data warehouse is, er, not that big. Out of 156 responses (40% US), only 12% had enterprise data warehouses larger than 4TB, while 18% had ones between this and 1 TB, while the rest had data warehouses less than 1TB. Indeed 25% had data warehouses less than half a terabyte. Admittedly only 20% of customer had just one data warehouse, with 26% having over five warehouses, but these figures may seem odd when you hear about gigantic data warehouses in the trade press. Winter Group publish a carefully checked list of the 10 largest data warehouses in the world, and their 2005 survey shows the winner at Yahoo weighing in at 100TB. The tenth largest, however (at Nielsen), is 17TB, which shows that such mammoths are still a rarity.

Why are IT folks obsessed about this? I can recall speaking at a data warehouse conference a few years ago and speaker after speaker eagerly quoted the size of his data warehouse as some sort of badge of courage: “Well, you should see how big mine is…”. Of course companies that sell hardware and disk storage love such things, but why is there such a big discrepancy between the behemoths in the Winter Group survey and the less than a terabyte brigade? The answer is quite simple: business to business companies don’t have large transaction volumes. If you are a large retailer or a high-street bank, then you may have thousands of branches, each one contributing thousands of individual transactions a day. These add up, and constitute the vast majority of the volume in a data warehouse (perhaps 99% of the volume). The rest of the data is the pesky master data (or reference data, or dimension data – choose your jargon) such as “product”, “customer”, “location”, “brand”, “person”, “time” etc that provides the context of these business transactions. You may have millions of transactions a day as a retailer, but how many different products do you stock? 80,000 for a convenience store chain? 300,000 for a department store? Certainly not tens of millions. Similarly McDonalds has 27,000 retail outlets, not millions. The same for organizational units, employees etc. One exception that can be very large is “customer” but again this is true only for business to consumer enterprises e.g. retailers or Telcos. Companies like Unilever are very large indeed, but primarily sell to other businesses, so the number of direct customers they deal with is measured in the many thousands, but not millions.

So B2B enterprises usually have quite small data warehouses in volume, even though they may have extremely sophisticated and complex master data e.g. elaborate customer segmentation or product or asset classification. One way to measure such complexity is by adding up the types of business entity in the data model e.g. each level of a product hierarchy might count as one “class of business entity” (CBE), “customer” as another. Some very large data warehouses in volume terms often have very simple business models to support, perhaps with 50 CBEs. On the other hand a marketing system for a company like BP may have 400 or more CBEs. This dimension of complexity is actually just as important as raw transaction size when looking at likely data warehouse performance. A data warehouse with 1TB of data but 50 CBEs may be a lot less demanding than one with 200GB of data but 350 CBEs (just think of all those database joins). Oddly, this complexity measure never seems to feature in league tables of data warehouse size, perhaps because it doesn’t sell much disk storage. I feel a new league table coming on. Anyone out there got a model with more than 500 CBEs?

How much data is there?

Bob Zurek rightly questions a statement from Eric Schmidt at Google that “there are 5 million TB of data in the world”. A study by the University of California at Berkeley estimated that there were 1,500 million TB of data produced in the world just in 1999.

Another interesting question is how much structured information there is in large companies. If you restrict the issue dramatically, to structured data on servers (not PCs) in large corporations, it is still a large number. With the largest data warehouses in the world weighing in these days at around 100 TB, and 1 TB data warehouses being quite commonplace, even if each Global 5,000 company had just 10 TB of total data, then we have a number 50,000 TB. This excludes the vast number of companies in the world that are below the Global 5,000. Certainly very large companies in industries like retail, Telco and retail banking will certainly have hundreds of terabytes of structured data each. Those with long memories will recall the puny disk drive sizes that we used to deal with even on mainframes in the 1980s, which does make you wonder, given that all those big companies worked just fine then. As they say about processors, Intel giveth, and Microsoft taketh away….. – perhaps there should be a similar truism for disk storage?