Wednesday, March 19, 2008

Backwards compatible? One more lie by omission

Stéphane Rodriguez, March 2008

Previous articles :
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design

Ever since Microsoft Office XML file formats were introduced, the excuse for using such poorly engineered markup was that it was designed to be backwards compatible with the so-called billions of existing documents out there.

Is it really the truth? Or is Microsoft using it to fuel more fire and motion?

Lying by omission, as per Wikipedia :
Lying by omission is when an important fact is omitted, deliberately leaving another person with a misconception. This includes failures to correct pre-existing misconceptions. One may by careful speaking contrive to give correct but only partial answers to questions, thus never actually lying.

Backwards and forwards compatibility

What's backwards compatibility first of all? For everyone in the world, except for Microsoft, file backwards compatibility means that product in version N can work seamlessly with files produced by the product in version N+1. For Microsoft however, it's the opposite, backwards compatibility means that product version N+1 can work seamlessly with files produced by the product in version N. By "work seamlessly" is meant no visible degradation, no noticeable loss in functionality.

To claim that a file format is backwards compatible is therefore quite an ambitious goal, especially if the file format in question is complex. It's so ambitious that it's almost an impossible task in practice. In fact, it behaves much like a theorem in mathematics. To prove the theorem is wrong, you only need one counter-example. To prove backwards compatibility is not met, you just need to come up with one counter-example.

So let's take a simple chart and see how backward compatibile it is.

  • Start Excel 2003 (or any Excel version older than Excel 2007)
  • Type jan, hit <TAB>, 3, <TAB> 4, <TAB>, 5
  • Select those cells, click the chart creation button in the main toolbar and choose a column type bar

The chart shows up like this :

A simple chart made with an older Excel version

Now resize the chart, and you can see that everything gets scaled accordingly : text in all chart elements, plot area, ...

Resizing a simple chart made with an older Excel version

  • Now start Excel 2007
  • Open this file
  • The chart shows up

The chart file (.xls) opened in Excel 2007

Now resize the chart, and you can see that fonts don't scale :

Resizing the chart file (.xls) in Excel 2007 does not scale the fonts

If you would like to resize the fonts, you have to select each chart element individually, and then apply the font size with user interface elements. Quite counter-productive. In addition, since that's manually done, proportions are not kept between chart elements and you cannot easily reproduce the same effect than with older versions of Excel.

Now if you save this file as a .xlsx file using Excel 2007, and then convert it back to a .xls file, then open it with an older version of Excel (effectively executing a round-trip scenario), the chart shows up but it does not scale when it is resized :

After the conversion back to (.xls), the fonts do not scale if we resize the chart

So much for backwards compatibility...

In case you wonder what's going on, what happens is that Excel 2007 does not support the auto scale option, something very pervasive in chart formatting options. Here is the guilty option :

Root of the problem : Excel 2007 does not support "auto scale" anymore (screenshot from Excel 2003)

It is not only a feature that is gone from the user interface, it is an option that is lost in the document itself. In other words, sharing this file with any person inside or outside your company will result in discrepancies.

For your information, the internal Excel BIFF record that Excel 2007 does not support is the FBI record (part of the chart records, and therefore left unspecified in Microsoft's binary BIFF documentation, as explained in my previous article) :

(excerpt from MSDN Library, Feb 1998)

FBI: Font Basis (1060h)
The FBI record stores font metrics.

Offset | Name | Size | Contents

4 | dmixBasis | 2 | Width of basis when font was applied

6 | dmiyBasis | 2 | Height of basis when font was applied

8 | twpHeightBasis | 2 | Font height applied

10 | scab | 2 | Scale basis

12 | ifnt | 2 | Index number into the font table

If you are interested in taking a look at the corresponding Excel files :

Note that this is just one simple example. There are many examples of all kinds. For instance, in a previous article, I showed how encrypting a document did encrypt the metadata as well, contrary to older Office application versions, and doing so breaks all applications expecting to access the file's metadata whether the file is encrypted or not.

As for forwards compatibility, the same article already showed (BIFF11+ section) that any Excel 2007 feature would be lost or partially lost when converted back to older Excel versions and part of a round-trip scenario. So forwards compatibility is not true either generally speaking.

Behind the scene, what's happening?

Behind the scene, what's happening is that Microsoft replaced the chart subsystem in Excel 2007 with a new library but failed to test it enough, presumably because they wanted to rush Office 2007 out the door. Since charts can also be linked to Word and Powerpoint documents, this actually affects all 3 applications' backwards compatibility.

The real mistake they made is to kill the old chart subsystem source code. What they should have done to meet their backwards compatibility claim was to use the old chart subsystem for the compatibility mode (old Excel files), and the new chart subsystem for charts in new files. There is no other way they can achieve the claims.

If backwards compatibility is not proven technically speaking, what is Microsoft trying to achieve?

In the previous section, it has been demonstrated with a simple counter-example that backwards compatibility leaves much to desire. So if this capability is not technically sound, how come Microsoft keeps hammering it every single day?

A Microsoft Office spokeperson made the following statement on March 16, 2008 :

An Open Letter from Chris Capossela, Senior Vice President, Microsoft Office

In today's digital world, there is a critical need to access, share and archive information. People want to share information with co-workers, business partners, family members and others regardless of the technology platform or software application being used. People want to be able to store and archive documents so that the information they contain can be accessed well into the future. Long-term access to documents is particularly important in the case of governments given the critical and historical nature of government information and the associated requirements for preservation and access moving forward.

Why choose the Ecma Office Open XML File Formats?
Office Open XML fulfills an important customer need - the file format was designed to be backward compatible with the content and functionality in billions of existing documents. (...)

The same thing has been hammered a gazillion times by Microsoft since the "new" file formats were announced back in 2005. If you are a software developer, then it does not sound like anything special. In fact, you may even think there is no substance at all in it. But if you wear a CIO cap, it's a different thing.

Remember, IT people are responsible for running the systems. For them backwards compatibility is not heard in the technical sense, it is synonym of "won't break systems, no down time". See where it goes?

A little more help. Microsoft, by talking over and over about backwards compatibility is actually telling decision makers that they are pledging not to break systems when everyone in the organization uses the applications and those new file formats (remember that for anyone, .docx .xlsx and .pptx files are alien files unless they install Office 2007).

In fact, it goes a little deeper. And that's when it gets really interesting. Microsoft is not only saying their stuff is backwards compatible, they are actually saying that they are the only backwards compatible when it comes to Office documents.

It is another way to say that the only way to ensure no down time is to purchase licenses of Microsoft Office software. As simple as that.

To better understand the picture, consider the following. During the 2007-20xx period, Microsoft is taking a major risk by introducing new file formats that cannot be natively used by older versions of Microsoft Office. It's something they did back in 1995 and it did not go very well. If you remember, this caused havok in IT organizations because Office 95 and Office 97 file formats were substantially different and Office 95 could not open Office 97 documents. Microsoft did break file formats unprepared.

So it is without surprise that Microsoft is trying to make sure that CIOs get the message right this time. That is the reason why Microsoft has been investing so much in all kinds of additional components :
  • Office 2007 compatibility mode
  • Microsoft Office compatibility pack
  • Office migration manager
  • Office MOICE

By saying what they say, Microsoft is trying to keep the competition aside, despite the very vocal "open standards" claims. The "open standards" claims is actually a diversion. "open" is not in Microsoft Office DNA and culture. For instance, nobody in that team ever said publicly that the only reason we are dealing with VML in the file formats is because they rushed Office 2007 out the door (presumably to ship simultaneously with Windows Vista), and as a result they did not fully transition to DrawingML. All this debate about VML out there is entirely the result of Microsoft's own incompetence in shipping a finished product. Simply shameful because third parties now have to support both VML and DrawingML. There would not be such surprise with a company that does things "openly".

In the period of time (that spans through a number of years) where Microsoft takes the risk of introducing new file formats, CIO could decide to just as well switch to an alternative Office suite for a fraction of the budget. Microsoft knows it very well, hence the media blitz. Hence the characterization of the alternative Office file formats such as ODF as simplistic. Simplistic is supposed to mean : not backwards compatible. That is what CIOs understand, and that is FUD.

In this period of time, Microsoft wants to minimize the ratio of organizations switching to an alternative Office suite. To Microsoft Office, OpenOffice is Netscape. This is not a problem anymore when the new "OOXML" file formats reach critical mass. We are not there yet, but that's the plan : dry the competition as much as possible, which includes fake activities such as starting so-called open source projects. Something that is not exactly music to CIOs' hears, why should CIOs care? It's the opposite. When a CIO hears "open source project", he understands something other than backwards compatible, therefore unreliable stuff. The decision is a no brainer then.

So what we are talking about here is a well orchestred media communication targeted to CIOs, not software developers. Until you get that, you don't understand.

The consequences of this attitude

Now that we know that Microsoft is not interested in building a modern Office document model (after all, there is already an international ISO/IEC standard for that : ODF), it gets easier to understand why OOXML is such a mess.

It was recently brought up from people with vested interests in Office file formats that OOXML is a big mess :

  • Rob Weir (IBM) : "The Disharmony of OOXML"

    FormatText ColorText Alignment
    OOXML Text <w:color w:val="FF0000"/> <w:jc w:val="right"/>
    OOXML Sheet <color rgb="FFFF0000"/> <alignment horizontal="right"/>
    OOXML Presentation <a:srgbClr val="FF0000"/> <a:pPr algn="r"/>
    ODF Text <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end" />
    ODF Sheet <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>
    ODF Presentation <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>

  • Henning Brinkmann (OpenOffice contributor) : "OOXML Import In Writer: A Shape Is a Shape, Is a Shape?"

    When we started importing shapes from OOXML the Impress team already was able to import some shapes from OOXML files produced by PowerPoint 2007. This is DrawingML as described in chapter 5 of the Markup Language Reference for OOXML. But, if you use Word to insert a rectangle into a Word document (DOCX), you end up with VML as described in chapter 6 of the Markup Language Reference for OOXML. The Markup Language Reference tags VML as a deprecated format in OOXML, which is only included to the standard for backward compatibility reasons. Despite, Word 2007 uses VML to store shapes. (...) OOXML seems to be designed with the application model in mind. There may be different syntaxes for the same semantics, if it fits the already present application model better. But, if you want to create an alternative implementation for the format, this introduces additional effort.

Fire and motion, anyone? I also gave a few examples of that as well in my introductory OOXML is defective by design article where I showed that Excel SpreadsheetML uses no less than 6 ways to do basic text formatting, for no reason. If you do anything meaningful with those files, you need to implement both, which clearly means you'll be spending years doing so instead of concentrating on your own business.

For anyone except Microsoft, it's fire and motion and it guarantees there won't be a substantial implementation of file formats before a number of years. That is exactly the same thing than the strategy behind with older Microsoft Office formats.

Note that Microsoft claims a number of third-parties applications support OOXML. But they never go into details such as how much support there is. For a reason that is trivial.

To the CIO's hear, the fact that backwards compatibility is never associated in any shape or form to those third-party applications makes it very clear that they are not worth considering seriously. So there.

Why is it a problem?

Let's take Gnumeric. Gnumeric chose to implement some of OOXML by having import and export filters. This is a very important architecture decision. It means that the in-memory representation is not high-fidelity representation of the imported file until Gnumeric directly supports all tiny details in OOXML. Something that won't happen before a number of years. And it is therefore without much surprise that exporting the in-memory representation to say an .xlsx file will lose all features that Gnumeric does not directly support.

To see the effect of that, simply try to open and save the following .xlsx file in Gnumeric : Book_2.xlsx

Gnumeric's internal architecture is simply wrong : it should instead preserve the imported file. That is after all what XML fragments are for. It should be able to make in-place replacements of XML fragments and leave the rest of the file untouched. That's not what Gnumeric does. Gnumeric is an example of an application that is not XML native.

Why is Microsoft making so much good press for Gnumeric and it's so-called rich OOXML support is left for one to guess...

...especially when Office 2007 is not much XML native either. In fact, that is actually why we end up with so many versions of the truth in Office 2007 files. Why we have many different and incompatible ways to align a piece of text, or describe a vector shape, or text formatting is due to the following :

  • It is a direct reflection of the fact that Microsoft Office engineering teams don't share any of their work across teams. The trivial example of that is that the only shared spec, Office drawing library (mso.dll), ends up being a mix of VML and DrawingML despite the fact that VML is only a markup version of MSO, whereas DrawingML is supposed to be MSO version next (internally known as E2O).

  • What Microsoft calls XML streams are in fact XML fragments where each individual fragment are angle-brackets around a stream of content that reflects the internal binary representation (and in fact, the binary records of older files) to be converted back and forth in full-fidelity.

    Let's say in one of those fragments, an attribute relates to a horizontal text alignment option, one value out of four, we end up with an attribute of the form <w algn="r"> where algn is one of r, l, c, j. In another fragment, a horizontal text alignment option is one out of eight, in which case we may have <a:alignment>left</a:alignment>, where the value is left or general, right, center, justify, distributed, fill or center across selection.

    Here is the problem. The fact that the text alignment has several distinct definitions in the same document is a bug. Engineering, peer-review and interoperability practices inherently put enough pressure to resist from having such thing done in the first place, that's why both alignment options should be the same. There is no reason to have more than one in the document. To standardize bugs is simply not excusable. It has to be stopped.

    The reason why we end up like this with OOXML is because it just reflects the bugs in the binary records, and that it works as if the Office team had developers writing compatibility code independently from each other, that is without ever giving a chance of code reuse.

    So, due to the fragments being just another representation of the binary records, Office 2007 use of XML terminology is absolutely misleading. Press pass filled with the XML acronym all over the place are very appealing to CIOs, however. It is more fair to say it's angle brackets around complex stuff than actual XML. It that were truly native XML, that would be factored in to maximize the reuse of it across libraries, components and applications. Just like ODF does.

So there you have it. It is without much surprise that Office 2007 does not ship with any XML tooling support.