Monday, May 10, 2010

The OOXML interoperability scam

St├ęphane Rodriguez, May 2010

Every time the Microsoft Office team pushes a comment on the wire, there is another pledge for interoperability. It has been so common for the last few years that if you haven't actually watched what it might mean, pretty much OOXML is synonym with interoperability.

Of course, it does not matter that the word interoperability alone does not mean anything. That is why Microsoft uses it so much. You can pretty much put an interoperability label to anything as long as it is not accurately defined. Does it mean document-level interoperability? Application-level interoperability? Or, perhaps is it just Microsoft-only interoperability (a good guess!)?

The pledge for interoperability cannot possibly mean document-level interoperability since we are not there : OOXML is full of non-XML streams, barely defined at all (the official papers lack everything related to international features, and that is just one example), so that ends any serious discussion precociously. In the remainder of his article, I'll be taking a look at application-level interoperability, in case Microsoft means that.

Just out of generosity I'll be doing tests with Office 2010, the latest version their crappy product suite, allowing them to take the time (no less than 3 years) worth of improvements for features that are supposed to be part of Office 2007 already (which RTM'ed in Nov 2006). Specifically I'll be focusing on Excel 2010 since I've already mentioned a lack of application-level interoperability in my article about content controls in Word (pointing out that so-called XML features like this actually require running Word instances to work, contrary to the open standard pledge).

A good example, when it comes to application-level interoperability, is what is stored in the Windows clipboard when you copy/paste content to or from an Excel spreadsheet. Indeed, Microsoft Office has been used to store binary formats in the clipboard, i.e. proprietary, therefore limiting interoperability across applications. Scenarios like this include the ability to copy/paste content with high fidelity between a running instance of Excel and a running instance of another application (related to spreadsheets or not, assumging it's a OOXML client). Therefore a good measure of whether or not Microsoft has improved interoperability in the OOXML timeframe is whether Excel 2010 actually stores OOXML in the clipboard. Let's see if that is the case.

To run the test, we simply create a trivial spreadsheet with a few values in cells and a couple graphics :

A simple spreadsheet to run the test

The test consists in selecting this cell area and hitting Ctrl-C. That stores the content in the clipboard (hosted by the Windows shell). We know that if we hit Ctrl-V in a new spreadsheet, the content will be pasted with high fidelity, leading us to believe that not only everything needed to do just that is stored in the clipboard, but given the advances and openness of OOXML, we might be able to do that from a separate application.

Let's see what is in the clipboard at that moment. To do that, we simply use the built-in Windows clipboard viewer : hit Windows+R, enter clipbrd.exe and hit enter. We can see, by default, a text version of what is supposed to be there :

The content of the clipboard

The clipboard stores content in one or more formats, either standard or proprietary. The clipboard viewer lists them :

Storing part of a spreadsheet : internal file formats in the clipboard

  • Enhanced metafile (0 byte)
  • Metafile (16 bytes)
  • Bitmap (0 byte)
  • Unicode text (254 bytes)
  • Text (127 bytes)
  • Displayed text (14 bytes)
  • Regional parameters (4 bytes)
  • OEM text (127 bytes)
  • DIB Bitmap (829320 bytes)
  • DataObject (4 bytes)
  • Biff12 (6219 bytes)
  • Biff8 (15872 bytes)
  • Biff5 (10752 bytes)
  • Sylk (1615 bytes)
  • DIF (848 bytes)
  • XML Spreadsheet (1972 bytes)
  • HTML Format (13763 bytes)
  • Csv (127 bytes)
  • Hyperlink (156 bytes)
  • Rich Text Format (32768 bytes)
  • Embed source (161348 bytes)
  • Native (161348 bytes)
  • OwnerLink (35 bytes)
  • Object Descriptor (154 bytes)
  • Link Source (188 bytes)
  • Link Source Descriptor (154 bytes)
  • Link (45 bytes)
  • ObjectLink (54 bytes)
  • Ole Private Data (728 bytes)

In the screen capture above, non-grayed formats are standard and implemented by pretty much all applications, whereas grayed-out formats are proprietary and application-specific (Excel). Of all formats, only a couple can actually contain the relevant content in high fidelity. These are :

  • Biff12
  • Biff8
  • Biff5
  • XML Spreadsheet
  • HTML Format

Let's take a look at the list we have.

  • BIFF is the acronym of the binary Excel file format. 5, 8 and 12 are revisions of the file format.
  • XML Spreadsheet is actually Excel 2003's data-only XML file format, a file format deemed so poor that it was completely rewritten to what became OOXML. And of course, this is just data so we lose the graphics.
  • HTML format : this is spaghetti HTML-MSO markup for pure display purposes. Graphics are expressed in terms of bitmaps so we would lose the definition.

As a result, only the BIFF formats are really at our disposal. It goes even worse when you realize that BIFF12 is actually not BIFF at all. In fact, in every single Excel release, Microsoft creates a new revision of the BIFF file format and for some reason it is still called BIFF8. BIFF12, on the other hand, is a completely new file format. It is Zipped and respects the OPC packaging, but it is made of .bin entries which store a new file format which is neither the regular BIFF nor XML.

The conclusion is that there is actually no way for an OOXML consumer application to rely on the "standard" OOXML to interoperate with Excel 2010 (and Excel 2007). It's all back to binary formats even though, by registering proprietary formats in the clipboard, the Excel team had years to implement this opportunity to store real OOXML in there. What is a "standard" good for if it's not used by the one and only reference application out there? This is lousy engineering at its worst. Or, for what matters, lock-in strategy.

The article could end here, but just to show how backwards is just about everything Microsoft does, let's take a look at how Excel interoperates with itself when it comes to actually realizing a Ctrl-C / Ctrl-V.

Let's take our test spreadsheet again. Select the cell area and hit Ctrl-C. Now create a new workbook in the same Excel instance while leaving the other workbook open, then hit Ctrl-V. Everything is copied there, in high fidelity including graphics.

Now if you create the new workbook in a new Excel instance (so you have two running Excel instances), hitting Ctrl-V will not copy the graphics and will go as far as incorrectly copy the data : see for instance we had the list of values {2,3,5} and they have been turn to {2-Jan-00, 3-Jan-00, 5-Jan-00}. Formulas are lost and everything. Which is utterly ridiculous.

Copying content across two Excel instances is inaccurate, incorrect and incomplete

At this point, we can conclude that either copying content accurately across Excel instances does not work at all (so a few hundred million clients out there are using a broken product), or it incorrectly uses the content in the clipboard, or the way it works is when the SAME Excel instance communicates across workbooks in order to pass internal data structures and graphics structures. Which means the clipboard is just for second class or third class citizens, and copy/paste is simply unreliable unless you are doing it with Excel. Which is the very definition of a lock-in product. Microsoft has every right to sell lock-in products, but they should not try to lie governments and the public out there : the OOXML standard is no evidence of any form of application-level interoperability. We have just proven it with a trivial scenario.

Just for the record, the BIFF12 file in the clipboard does not contain graphics at all. So even what is stored in the clipboard is just second class or third class citizenship.

Comparing OpenOffice and Microsoft Office

Microsoft Office is just a disaster. Out of sheer curiosity, and out of fairness, it is interesting to see how OpenOffice compares. Remember that, in the public out there, Microsoft apologists call OpenOffice a simplistic product that does not even support formulas.

Again, we'll be sticking with OpenOffice calc, the counterpart of Microsoft Excel. Tests will be done with OpenOffice 3.2, the latest major release of the product suite.

Comparing the application-level interoperability in OpenOffice 3.2

The content in question is a replica of what has been used so far. In OpenOffice, we select the cell area around a few cells, formulas and graphics. And we hit Ctrl-C. And then looking up the clipboard, we have the following formats in there :

  • DataObject (4 bytes)
  • Star Embed Source (XML) (10563 bytes)
  • Star Object Descriptor (XML) (77 bytes)
  • GDIMetaFile (28597 bytes)
  • CF_ENHMETAFILE (0 bytes)
  • CF_METAFILEPICT (16 bytes)
  • CF_DIB (632648 bytes)
  • Windows Bitmap (632662 bytes)
  • HTML (HyperText Markup Language) (10679 bytes)
  • HTML Format (10785 bytes)
  • CF_SYLK (164 bytes)
  • Link (36 bytes)
  • CF_DIF (1422 bytes)
  • CF_UNICODETEXT (472 bytes)
  • CF_TEXT (236 bytes)
  • Rich Text Format (5876 bytes)
  • Ole Private Data (568 bytes)
  • CF_LOCALE (4 bytes)
  • CF_OEMTEXT (236 bytes)
  • CF_BITMAP (0 bytes)

Off these, only a couple can represent the content in full fidelity (given their size and name). The potential candidates are :

  • Star Embed Source (XML) (10563 bytes)
  • HTML (HyperText Markup Language) (10679 bytes)
  • HTML Format (10785 bytes)

Obviously HTML won't be kept much longer for further inspection since HTML is by design spaghetti markup designed for display purposes only. Graphics are replaced by bitmaps so we lose the definitions.

In fact only one format remains, Star Embed Source. Chances are that this name comes from Star Office, the original name of the OpenOffice project. And indeed, upon inspection, it's an .ODS file, that is a full fidelity copy of the content.

If we grab the content of the clipboard, make sure to call it a .ODS file and open it in OpenOffice, it shows the following :

Grabbing the content of the clipboard : full fidelity .ODS file

It has everything you inspect, including formulas, formatting and graphics themselves. So if you were an ODS client, just reading the clipboard, you are able to interoperate with OpenOffice. This is the very definition of application-level interoperability. And it's very surreal that it works with OpenOffice, and not Microsoft Office...

Now for the second test, let's see if we copy the content and paste it in a new workbook of another instance of OpenOffice, does it remain in full fidelity?

Simply select the cell area again, hit Ctrl-C, then go in the OpenOffice install folder (double-clicking on the OpenOffice shortcut in the desktop will not start a new instance so we have to workaround that), unfold c:\Program files\OpenOffice 3\program and double-click on scalc.exe. A new OpenOffice instance starts. Hit Ctrl-V. And you can see for yourself that everything is there : cells, formatting, formulas, graphics.

Simple tests like this leave me a bit speechless when you see that Microsoft Office is supposed to be the rolls royce of Office programs in the world, the de facto standard. And in fact it's just crap. On the contrary OpenOffice, the free suite, is actually a more serious product when it comes to application-level interoperability. This had to be said...