Andy on Enterprise Software

Should ETL really be ELT?

March 16, 2006

Traditionally ETL (extract/transform/load) products such as Informatica, Ascential and others have fulfilled the role of getting data out of source systems, dealing with inconsistencies between these source systems (transform) and then loading the resultant transformed data into a set of database tables (perhaps an operational data store, data marts or directly to a data warehouse).
However in the process of doing the “transform” a number of issues crop up. Firstly, you are embedding essentially what is a set of business rules (how different business hierarchies like product classifications actually relate) directly into the transformation rules. This is a dark place should you want to make sense of them in other contexts. If the rules are complex, which they may well be, then you can create a Frankenstein’s monster of transform rules that become difficult to maintain, in a set of metadata that may be hard to share with other applications.

Moreover this is a one-way process. Once you have taken your various product hierarchies (say) and reduced them to a lowest common denominator form, then you can certainly start to analyze the data in this new form, but you have lost the component elements to all intents and purposes. These different product hierarchies did not end up different without some reason; they may reflect genuine market differences in different countries, for example. Moreover they may contain a level of richness that is lost when you strip everything down to a simpler form.

Ideally in a data warehouse you would like to be able to take an enterprise view, but also retain the individual perspectives of different business units or countries. For example it may be interesting to see the overall figures in the format of a particular business line or country. Now of course there are limitations here, since data from other businesses have not have sufficient granularity to support the views required, but in some cases this can be fixed (for example by providing additional allocation rules) and at least you have a sporting chance of doing something useful with the data if you have retained its original richness. You have no chance if it is gone.

Hence there is a strong argument to be made for an “ELT” approach, whereby data is copied from source systems pretty much untouched into a staging area, and then only from there is transformation work done on it to produce cross-enterprise views. If this staging area is controlled by the data warehouse then it is possible to provide other, alternate views and perspectives, possibly involving additional business metadata at this stage. The only real cost in this approach is some extra storage, which is hardly a major issue these days. Crucially, the transformation logic is held within the data warehouse, which is open to interrogation by other applications, and not buried away in the depths of a proprietary ETL format. Moreover, the DBMS vendors themselves have added more capability over the last few years to deal with certain transformations; let’s face it, a SQL SELECT statement can do a lot of things. Since the DBMS processing is likely to be pretty efficient compared to a transformation engine, there may be performance benefits also.

This approach has been taken by more modern ETL tools like Sunopsis, which is explicitly ELT in nature. Intriguingly, Informatica added an ELT option in Powercenter 8 (called the “PowerCenter 8 pushdown optimization”), which suggests that this approach indeed is gaining traction. So far, good on Sunopsis for taking the ELT approach, which I believe is inherently superior to ETL in most cases. It will be interesting to see whether Ascential also respond in a future release.

The unbearable brittleness of data models

March 14, 2006

An article in CRM Buyer makes an important point. It highlights that a key reason why customer data integration projects fail is the inflexibility of data model that is often implemented. Although the article turns out to be a thinly disguised advert for Siperian, the point is very valid. Traditional entity-relationship modeling typically is at too low level of abstraction. For example courses on data modeling frequently give examples like “customer” and “supplier” as separate logical entities. If your design is based on such an assumption, then applications based on this will struggle if one day a customer becomes a supplier, or vice versa. Better to have a higher level entity called “organization”, which can have varying roles, such as customer or supplier, or indeed others than you may not have thought of at the time of the modeling. Similarly, rather than having an entity called “employee” it is better to have one called “person”, which itself can have a role of “employee” but also other roles, perhaps “customer”for example.

This higher level of data modeling is critical to retaining flexibility in systems, removing the “brittleness” that so often causes problems in reality. If you have not seen it, I highly recommend a paper on business modeling produced by Bruce Ottmann, one of the world’s leading data modelers and whose work has found its away into a number of ISO standards. Although Bruce works for Kalido, this whitepaper is not specific to Kalido but rather discusses the implications of a more generic approach to data models.

I very much hope that the so-called “generic modeling” approach that Bruce recommends will find its way into more software technologies. Examples where it does are Kalido and Lazy Software, and, although in idea rather than product form, in the ISO standard 10303-11, which covers a modeling language called Express that can be used to represent generic data models. It came about through work originated at Shell and then extended to a broader community of data modelers, including various academics, and was particularly aimed at addressing the problem of exchanging product models; it is known as STEP. However the generic modeling ideas developed with this have much broader application than product data. Given the very real advantages that generic modeling offers, it is to be hoped that more software vendors pick up on these notions, which make a real difference to the flexibility of data models, and hence improve the chances of projects, such as CDI projects, actually working in practice.

ETL moves into the database

March 13, 2006

With SQL Server 2005 Microsoft has replaced its somewhat limited DTS ETL offering with SQL Server Information Services (SSIS), which is compared with IBM’s offerings (based on the Ascential acquisition). I have written previously about the shrinking of the ETL vendor space, and the enhanced Microsoft offering will merely accelerate this. Oracle has its Warehouse Builder technology (despite its name this is really an ETL tool) as well as Microsoft and IBM, and as these tools improve it will be tough for the remaining ETL vendors. Informatica has broadened into the general data integration space, and seems to be doing quite well, but there are not many others.

Sunopsis is innovative with its “ELT” approach which sensibly relies on, rather than competes
with, the native DBMS capabilities, but it remains to be seen how long it can flourish, given that the DBMS ETL capabilities will just keep getting better and eat away at its value. The surreal Ab Initio is reportedly doing well at the high volume end, but given its secretive nature it is hard to say anything with certainty about this company other than its business practices and CEO are truly eccentric (a fascinating account of its predecessor Thinking Machines can be found at the following link). Data Junction has a strong reputation and is OEMed by many companies (it is now part of Pervasive Software). There are a few other survivors, like ETI, who have just recapitalised their company after struggling for some years , but it is hard to see how ETL can remain a sustainable separate market in the long term. Indeed Gartner has recently stated that they will are to drop their “magic quadrant” for ETL entirely.

The future of ETL would appear to be in broader offerings, either as part of wider integration software or as just a feature of the DBMS.

The hollowing out of ERP

March 9, 2006

Now that there are effectively two enterprise ERP vendors bestriding the world, it may seem that they can just sit back and count the spoils. Both have huge net profit margins derived from their market leadership, so it may seem churlish to contemplate their eventual demise, yet a number of factors are combining that should cause a few flutters in Redwood City and Walldorf. Consider for a moment what a transaction system application such as ERP actually does, or used to do:

  • business rules/workflow
  • master data store
  • transaction data store
  • transaction processing
  • user interface
  • (and perhaps some business content e.g. pre-built reports)

This edifice is under attack, like a house being undermined by termites. Transaction processing itself has long been mostly taken care of elsewhere, by old-fashioned TP monitors like IBM CICS or by new-fashioned TP monitors like BEA’s Weblogic or IBM Websphere. These days there are alternate workflow engines popping up, like Biztalk from Microsoft, or even a slew of open source ones. Moreover, more than half of the ERP functionality purchased is unused. The storage of data itself is of course done in the DBMS these days (though SAP tries hard to blur this line with its clustered table concept). As the idea of separate master data hubs catches on e.g. customer data hubs like Siperian’s, or product data hubs, or more general ones, and the serving up of such data is possible through EAI technology, then this element too is starting to slip away from the ERP vendors. The user interface for update screens should hardly be that complicated (though you’d never guess it if you have ever had the joy of using SAP as an end user), and these days can be generated from applications e.g. from a workflow engine or a master data application. This does not leave a great deal.

If, and it is a big if, SOA architecture takes off, then you will also be able to plug in your favorite cost allocation module (say) from a best of breed vendor, rather than relying on the probably mediocre one of your ERP supplier. Combine this with the emergence of “on demand” hosted ERP services from emerging companies like Ataio and Intacct as alternatives, and the vast ERP behemoth looks a lot less secure up close than it may do from a distance. If the master data hubs and business workflow engines continue to grow in acceptance and chip away further at key control points of ERP vendors, then at some point might it be reasonable to ask: exactly what is it that I am paying all those dollars to ERP vendors for?

This line of reasoning, even if it is very early days, explains why SAP and Oracle have been so anxious to extend their product offerings into the middleware space, with Netweaver and Fusion respectively. This is also what SAP has been trying to falteringly launch an MDM application (the rumor is that after the botched initial SAP MDM, the buy-in of A2i isn’t going that well either; maybe a third attempt is in the works?) and Oracle has been keen to promote its customer hub.

Of course it is too soon to be writing the obituaries of ERP yet, but a combination of evolving technologies is starting to illuminate a path for how you would eventually migrate away from dependence on the giant ERP vendors, rather than endlessly trying to consolidate on fewer vendors, and fewer instances of each. Now that would be radical thinking.

Information as a service?

March 8, 2006

I see in our customer base the stirrings of a movement to take a more strategic view of corporate information. At present there is rarely a central point of responsibility for a company’s information assets; perhaps finance have a team that owns “the numbers” in terms of high level corporate performance, but information needed in marketing and manufacturing will typically be devolved to analysts in those organizations. Internal IT groups may have a database team that looks after the physical storage of corporate data, but this group rarely have responsibility for even the logical data models used within business applications, let alone how those data models are supposed to interact with one another. Of course things are complicated by the fact that application packages will have their own version of key data, and may be the system of record for some of it. Yet how to take a view across the whole enterprise?

organizationally, what is needed is a business-led (and not IT-led) group with enough clout to be able to start to get a grip on key corporate data. This team would be responsible for the core definitions of corporate data, its quality, and being the place that people come to when corporate information is needed. In practice, if this is not to become another incarnation of a 1980s data dictionary team, then this group should also have responsibility for applications that serve up information to multiple applications, and this last point will be an interesting political battle. The reason that such a team may actually succeed this time around is that the technologies now exist to avoid the “repository” (or whatever you want to call it, of master data being a passive copy. Now the advent of EAI tools, enterprise buses, and the more recent master data technologies (from Oracle, Kalido, Siperian, IBM etc) mean that master data can become “live”, and synchronized back to the underlying transaction systems. Pioneers in this area were Shell Lubricants and Unilever, for example.

However technology is necessary, but not sufficient. The team needs to be granted ownership of the data, this notion sometimes being called “data stewardship”. Even if this ownership is virtual, it is key that someone can arbitrate disputes over whose definition of gross margin is the “correct” one, and who can drive the implementation of a new product hierarchy (say) despite that fact that such a hierarchy touches a number of different business applications. It is logical that such a group would also own the enterprise data warehouse, since that (if it exists) is the place where much corporate-wide data ends up right now. This combination of owning the data warehouse and the master data hub(s) would allow infrastructure applications to be developed that can serve up the “golden copy” data back to applications that need it. The messaging infrastructure already exists to allow this to happen.

A few companies are establishing such groups now, and I feel it is a very positive thing. It is time that information came out if its back-room closet and moves to centre stage. Given the political hurdles that exist in large companies, the ride will not be smooth, but the goal is a noble one.

Broaden your horizons

March 6, 2006

At a talk at the recent TDWI show, consultant Joshua Greenbaum, an analyst with Enterprise Applications Consulting (who?) managed to bemoan the cost of data warehouses, but then demonstrates a seeming lack of knowing exactly what one is by claiming that the alternative is to do “simple analyses of transactional data”. Well Joshua, that is called an operational data store, and indeed it has a perfectly respectable role if all you want to do is to look at a single operational system for operational purposes. However a data warehouse fulfils quite a different role: it takes data from many different sources, allows analysis across these inconsistent sources and also should provide historical context e.g. allowing comparisons of trends over time. You can’t do these things with an operational data store.

Hence it is not a case of “ODS good, data warehouse bad” – instead both structures have their uses. Of course Joshua is right in saying that data warehouse success rate is not great, but as I have written elsewhere, it is not clear whether data warehouse projects are really any worse than IT projects in general (admittedly, that is not setting the bar real high). Perhaps Joshua was misquoted, but I would have expected something more thoughtful from someone who was an analyst at Hurwitz. Admittedly he was an ERP (specifically SAP) analyst, so perhaps has a tendency to think of operational things rather than things wider than ERP. Perhaps he is suffering from the same disease that seems to affect people who spend too much time on SAP.

Interest in MDM grows

Last week I was a speaker at the first CDI (customer data integration) conference, held in San Francisco. Although the CDI institute (set up by Aaron Zornes, ex META group) started off with customer data integration, looking at products like Siperian and DWL, the general movement towards MDM as a more generic subject has overtaken it, and indeed Aaron mused in his introductory speech whether they may change the title to the MDM Institute. For a first conference it was well attended, with 400 people there and supposedly 80 being turned away due to unexpectedly high demand. There were the usual crowd of consultants happy to advise expertly on a topic they had never heard of a year ago. Most of the main MDM vendors put in an appearance e.g. IBM/Oracle/I2 (but no SAP) as well as specialists like Siperian and Purisma, plus those like HP who just have too big a marketing budget and so have a booth everywhere, whether or not they have a product (those printer cartridges generate an awful lot of profit).

The conference had a rather coin-operated feel, as sponsoring vendors duly got speaker slots in proportion to the money they put in – with IBM getting two plenary slots, but there were at least a few customer case studies tucked away amongst the six concurrent conference tracks. My overall impression was that MDM is a bit like teenage sex: everyone is talking about it, people are eager to know all about it but not that many are actually doing it. As time passes and MDM moves into adolescence there will presumably less foreplay and more consummation.
Further conferences are planned in London, Sydney and Amsterdam, demonstrating if nothing else that plenty of vendors are willing to pay Aaron to speak at the shows.