A Christmas Message

I must have nodded off in the armchair after a glass or two of seasonal cheer when I was confronted by an apparition – a ghost of data warehouses past. He was not a pretty sight – dating back to the mid 1990s and with a grumpy attitude. In those days all you had was a database and a compiler, and a customer who wanted to somehow bring together data from multiple, incompatible systems. There were no ETL tools, no data warehouse design books and little in the way of viewing the data once you had managed to wrestle it into the new data warehouse, just reporting tools that just about saved you coding SQL but confronted users with the names of the tables and columns, usually restricted to eight characters. Those were the days, a salutary reminder of how primitive things were.

Following this was a ghost of data warehouses present. This was a much chirpier looking fellow, who could gain access to data from even devious and recalcitrant source systems via ETL tools like Ascential, and had at least some idea how to design the warehouse. These designs had suitably festive names: “snowflake schema”, “star schema”. If we had a reindeer schema then it would have completed the festive scene. This wraith had a sackful of reporting tools to access the data, count it, mine it and graph it. Yet not all was well – the pale figure seemed troubled. Every time the source systems changed, the warehouse design was impacted, and teams of DBA elves had to scurry around building tables, sometimes even dropping them. Moreover the pretty reporting tools were completely unable to deal with history: when reports were needed from past seasons there was no recollection of the master data at the time, and it did not reflect the current structures. The only ones really happy here were the DBA elves, who busied themselves constructing ever more tables and indices – they love activity, and they were content.

Last came a ghost of data warehouse future. In this idyllic world, changes in the source systems have no effect on the warehouse schema at all. The warehouse just absorbs the change like a sponge, and can produce reports to reflect any view of structure, present, past or even future. There are less DBA elves around, but they have discovered new things to do with their free time – with SAP up to 32,000 tables these days, there is no unemployment in DBA elf land. Best of all the customers are happy, as they can finally make sense of the data as quickly as their business changes. New master data can be propagated throughout their enterprise like so much fairy dust. I have seen the future of data warehousing, and it is adaptive.

Well, enough of that. Time to talk turkey, or at least eat it. While doing so, it may be relaxing to look back at some of the turkeys that our industry has managed to produce over the years. Human nature being what it is, there is more than one web site devoted to the follies of the technology industry, and even to the worst software bugs. Have fun reading these. I recall being told by a venture capitalist that the dot-com boom of the late 1990s proved that “in a strong enough wind, even turkeys can fly”. These are some that did not.

I would like to wish readers of this blog and very happy Christmas.

Open up your email – it’s the feds!

I recall a few days ago being sent an email that was so transparently obviously a virus that I thought “who on earth would click on such an obviously dodgy looking attachment?” It was an email from the FBI (yeah, right, email being the FBI’s most likely form of communication to me) saying that “your IP address had been logged looking at illegal web sites” and then inviting you to click on an attachment of unknown file type asking you to “fill in a form about this activity”. I’m guessing that if the FBI were troubled about my web site browsing they would be more likely to burst through the front door than send an email. At the time I chuckled to myself and deleted it, but apparently millions of people actually did decide to fill in the form about their dubious web browsing, immediately infected their PCs with the “sober” virus.

Against this depressing demonstration of human gullibility was at least the entertainment value of the deadpan statement from the real FBI, which read: “Recipients of this or similar solicitations should know that the FBI does not engage in the practice of sending unsolicited e-mails to the public in this manner.” Quite so.

Thanks


A big thank you to those of you who nominated this blog for the blog awards. I am pleased to say that it has been short-listed as one of the ten finalists for the independent tech blog of the year. I am so glad that you are finidng the blog interesting. Please register your vote for the final round.

Lies, damned lies, and Excel formulae

I made a discovery the other day. Not one of those “eureka” moments beloved of bathing Greeks, but something that prompted me to wonder about the accuracy of some of the figures we take for granted. We are so used to Excel on every desktop that we trust it implictly, and so when I needed to work out the standard deviation of some figures, I naturally turned to Excel. For those of you whose maths is rusty, standard deviation is how spread out a sample of numbers are. For example: the average of 1,3,5,7,9 is 5, and so is 3,4,5,6,7 but it can be seen that the latter sequence has its numbers more closely bunched. Standard deviation is just a mathematical measure of how close or otherwise that bunching is (in the examples above the standard deviation of the first set is 2.83 and the second set 1.41 i.e. the second set is closer bunched than the first set).

In Excel to use a function you just type into a cell something like “=average(1,3,5,7,9)” and magically you get the answer (5 in this case). So, what could be easier to do than type in:
“=stdev(1,3,5,7,9)” and see the answer appear? The trouble is that it doesn’t. The answer pops up as 3.16, not 2.83 as I was expecting. Just in case you doubt my ability to calculate a standard deviation, feel free to do it the old-fashioned way by hand, and you will see that you get 2.83, not 3.16, so what is going on? After digging around the Excel help and chatting to a mathematician friend to check I was not going completely mad, I discovered that there are actually two different standard deviation functions in Excel, one designed for where you want the whole sample set measured, and one where you want to estimate a large population from a sample, which has a slightly different formula. Now I may be getting a bit slow these days, but I did do a maths degree and yet this distinction had eluded me all these years, so I doubt I’m the only person out there unaware of this difference. If you were the person at Microsoft naming Excel functions, which do you think that people would think was the “normal” version, “STDEV” or “STDEVP”, which is what they actually named the function that calculates the standard deviation of a whole population. I am guessing that not too many of us go “aha, I’ll try “=STDEVP I expect that will be it”.

Now this may seem like a lot of fuss about an esoteric mathematical function, but be aware that standard deviation is one of the most commonly used statistical functions, used to look at samples of population, mechanical failure rates, delivery errors, temperatures, patient response rates, you name it. People take serious decisions based on statistics: which drug to put forward for clinical trials, traffic planning, machine maintenance and endless others; standard deviation is the most commonly used tool in “statistical process control”, widespread in the manufacturing industry. Given that the most of the modern world uses Excel, I find it pretty surprising that a sizeable proportion of the world has been using the wrong standard deviation function for the last twenty years all because some idiot in Seattle chose a “precise” name rather than the obvious name that most of us would have chosen.

I suppose this is only what I should have expected from a product who thinks that 1900 is a leap year. Try the formula “=DATE(1900,2,29)” and watch it happily display the 29th February 1900. As we should all be aware after the fuss over Y2k, 1900 is NOT a leap year (leap years are every four years, except centuries which are not, expect every fourth century, which is, so 1600 and 2000 are leap years, but not 1800 or 1800 or 1900). The moral of this little story: don’t take everything on trust!

A Halloween Tale

It’s a busy week in the master data management world, with big scary monsters out in the night and eating up smaller prey. We have seen Tibco acquire Velosel, and just today SAP acquire moribumd EII vendor Callixa, apparently for its “customer data integration efforts”. I’m not quite sure what potion SAP have been imbibing recently, but I could have sworn that they recently abandoned their own MDM offering, which after two years of selling into their massive user base had managed just 20 sites, and bought vendor A2i in order to replace this gooey mess with a new master data management offering based on A2i’s technology. Perhaps those with crystal balls available as part of their costume for their Halloween party this evening could inquire through the mists as to how buying a second vendor in the space matches up with the coherent vision of master data management that it is presumably trying to portray? At the moment this seems as clear to me as pumpkin soup.

Every vendor worth its salt now seems to be under the MDM spell, with hardly a week going by without a niche player getting gobbled up by one of the industry giants. Yet I continue to be surprised by the disjointed approach that many have taken, tackling two of the two most common key areas: customer and product, with separate technology. Sure, CDI and PIM grew up independently, but there are many, many other kinds of master data to be dealt with in a corporation e.g. general ledgers, people data, pricing information, brands and packaging data, manufacturing data to name just a few. One of our customers, BP, uses KALIDO MDM to manage 350 different types of master data. Surely vendors can’t really expect customers to buy one product for CDI, another for PIM, another for financial data, another for HR etc? This would result in a witches brew of technology, and most likely a mess of new master data technologies which in themselves will need some kind of magic wand waving over them in order to integrate the rival master data technologies. Just this nightmare is unfolding, with the major vendors each trying to stake out their offering as being the one and true source of all master data, managing all the other vendors’ offerings. I certainly understand that if any one vendor could truly own all this territory then it would be very profitable for them, but surely history has taught us that this simply cannot be done. What customers want is technology that allows master data to be shared and managed between multiple technology stacks, whether IBM, SAP, Oracle, Microsoft or whatever, rather than being forced into choosing one (which, given their installed base, is just a mirage anyway). Instead the major vendors seem to be lining up to offer tricks rather than treats.

.

Hiring Top Programmers

At Kalido we want to hire the best 1% of programmers. This is for a very good reason: the top 1% of programmers code 10 times as much code as the average ones, and yet their defect rates are half the average. This is a pretty amazing productivity difference, yet has been found consistently over the years e.g. by IBM. In order to try and search out these elusive people, we use a couple of different tests in addition to interviews. Firstly we use ability tests from a commercial company called SHL. In particular their “DIT5” test, aimed at programming ability, proves to be very useful. We found a very high correlation between the test results and our existing programming team when we tried this on ourselves, and we now use it for all new recruits. Another is a software design test that we developed ourselves. We find that very few people do a decent version of this, which allows us to screen out a lot of people prior to interview, sacing time for all involved.

I actually find it encouraging that some people don’t like to have to do such tests, thinking themselves above such things or (more likely) fearing that they won’t do well. This is an excellent screening mechanism in itself – as a company we want the very best, and in my experience talented people enjoy being challenged at interview, rather than being asked bland HR questions like “what are your strengths and weaknesses” (yeah, yeah we know, you are too much of a perfectionist and work too hard, yawn). Partly as a result of these tests, as well as detailed technical interviews, we have assembled a top class programming team.

I am encouraged that a similar view is shared by Joel Spolsky, who writes a fine series of his insights into software, “Joel on software”:

http://www.joelonsoftware.com/articles/HighNotes.html

which I highly recommend.