Tuesday, November 24, 2009

Office 2010, "operation Barbarossa" edition

Stéphane Rodriguez, November 2009

Previous articles :
- Microsoft's latest aggression on ODF, codenamed "cast lead"
- Beating a dead horse
- Follow up on Microsoft latest bullshit announcement
- Microsoft latest bullshit : native support of ODF in Office 2007
- Custom XML? What Custom XML?
- Backwards compatible? One more lie by omission
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design

With the public release of the latest crap from Redmond in the form of a beta preview of Office 2010, I thought it was a good time to revisit some blatant design and interoperability flaws in the file format, a file format which in and of and itself is more a farce and a fraud, an embarassment for whoever with a software engineering grade. Coming from Microsoft however, the digital nazis of our time, that was expected.

What is discussed here is not an exhaustive list of flaws which, if fixed, would make it a proper file format. There is no "just a few bad apples" thingy going on here. It is fatally flawed. This file format is defective by design. Furthermore it was designed and maintained by persons who should not be allowed to touch a keyboard ever again.

As we are going to see, even though no less than four years have passed (the beta of Office 2007 was made available in late 2005), the situation hasn't changed a bit. Microsoft has obviously concentrated their budgets elsewhere, thereby exemplifying that their agenda was not to come up with an XML-based file format for Office files which would be useful for the general interest, but instead to rubber timestamp whatever piece of crap they had and then move on. As can be seen by anyone putting his hands on it, XML is in fact a pretext. The very design of the file format makes them not suitable at all in a standards world revolving around XML tools, mappings and data/application platform interoperabilities.

Lest not forget that OOXML was defined not as an XML format per se, but as a mapping meant to preserve a so-called binary legacy. A mapping whose absence in the ECMA/ISO papers made the ISO blitz a fraud. Now Microsoft has since gone even further by beginning a global ethnic cleansing policy, the introduction of a malware called MS-ODF, i.e. a Microsoft canada dry version of the ODF file format, whose intent is not secret : what can be the purpose of deliberately storing Microsoft own proprietary formulas and other details in the ODF files (obviously expecting ODF formulas), and to limit the support of ODF to just a limited set of features ? Interestingly, while they had resources to cover for the planned genocide of existing ODF assets, they apparently did not have such resources for coming up with a Save As ISO 29500 format in Microsoft Office 2007 or Microsoft Office 2010, i.e. the claim with which they began the aggression war on civil people. This basic fact is now routinely hidden behind words. The Gauleiters at Microsoft claim to support transient OOXML files not strict OOXML files. The only problem is that "transient" means just about anything they want, i.e. including undocumented stuff, breaking changes, whatever pleases them. While "strict" means the actual ISO papers they fought so much for, to the point of bribing people left and right when the disastrous quality of the draft was pointed out by good willing parties. Microsoft does not intend to support what they fought for. Redmond-Nuremberg trials pending.

Anyway, here is a summary of flaws I came up with two years ago, and for each flaw I revisit whether its status has improved in Office 2010 beta.

1) Self-exploding spreadsheets
==> problem remains as is. More about it below.

2) Entered versus stored values
==> problem remains as is. Excel derailed the IT world two years ago with a floating-point flaw.

3) Optimization artefacts become a feature instead of an embarrasment
==> problem remains as is

4) VML isn't XML
==> problem remains as is

5) Open packaging parts minefield
==> problem remains as is

6) International, but US English first and foremost
==> problem remains as is. More to say about it below.

7) Many ways to get in trouble
==> problem remains as is

8) Windows dates
==> problem made even worse. See below.

9) All roads lead to Office 2007
==> problem remains as is

10) A world of ZIP+OLE files
==> problem remains as is

11) Document security is a (bad) joke
==> problem remains as is

12) BIFF is gone...not!
==> problem made even worse. Introduction of a new binary file format, BIFF14, with .XLS file extension to account for new features

13) Document backwards compatibility subject to neutrino radio-activity
==> problem made even worse. See below

14) ECMA 376 documents just do not exist
==> problem remains as is. Oviously ECMA 376 continues to be just vaporware. There is simply no "Save as ECMA 376" or "Save as strict ISO 29500" option available.

15) How the ISO OpenDocument format (ODF) compares?
==> problem made even worse. with the introduction of the MS-ODF file format, incompatible with the ISO ODF standard, the situation was intentionally made worse.

Now that we have a pretty good idea of how "good" (NOT!) Office 2010 is, let's add a few more flaws.

16) Self-exploding charts
17) International, but US English first and foremost
18) Worsening the issue with Windows dates
19) XML as bad as binary

16) Self-exploding charts

All it takes to get an idea of how much Microsoft pays attention to backwards compatibility is a counter-example. I show this case of a trivial .XLS spreadsheet including a chart, which opens well in all Excel versions including Excel 2007, but explodes in Excel 2010 beta. This spreadsheet is generated by a third-party component. Obviously Excel 2010 beta introduces breaking changes in how it parses .XLS spreadsheets. In this case, the legend is destroyed and the plot area bounces off the boundaries.

A trivial .XLS spreadsheet including a chart, opened in Excel 2003

The same file, opened in Excel 2007 SP1

The same file, open in Excel 2010 beta

17) International, but US English first and foremost

All it takes to get an idea of how much Microsoft respects the assets of their own customers is to use some of the functions exposed by Excel. Just creating a spreadsheet using an Excel language version, say in English, and handing it to someone owning a different language, say in French, results in a corrupt spreadsheet with no way to fix it.

Let's thus take the CELL() function. It's a helper function meaning to extract content or information from an arbitrary cell. The problem is, by design, the first parameter passed to this function is a string which is only evaluated in the corresponding language version of Excel.

In a spreadsheet created using a French version of Excel, let's insert in cell A1 the following function : CELLULE("adresse";B2:B5). The function returns $B$2. Now if you open this spreadsheet in an English version of Excel, the A1 cell now returns an error : #VALUE! and a floating tooltip which says : "A value used in the formula is of the wrong data type".

Well, not only the error message is incorrect, but one immediately sees the distribution effect of poorly designed cell functions.

Guess what, this has not been fixed in OOXML.

As a side note, here is how Excel 2007 or Excel 2010 beta store the mentioned function : CELL("adresse",B2:B5). Which means Excel went through several legitimate conversion steps in order to make it US-English, but not everything. "adresse" should have been converted to "address". It is amazing that this tool is used outside the USA. Any serious global compliance draft for platform interoperabilities would crush Excel as not suitable for such purpose.

Of course, a true fix is to create a CELL.ADDRESS() function to address the matter. And to make sure in the public documentation that the locales of Excel functions are listed.

You would think not only all of the locale flavors of the function and "address" strings would be documented. They are not. You would also think that Excel's user interface warns the user against using such function. It does not.

If you are interested in the (fake) ongoing debate between ODF and OOXML, you must read Microsoft apologists pretend that ODF does not support formulas and therefore OpenOffice Calc is just a simplistic spreadsheet tool unlike Microsoft Excel. Not only is this a lie from a user point of view, but from a standards point of view, there is much left to be desired in the Excel OOXML documentation. Where is the map of function names in the 35+ Excel locales?

If that's not amateur work, it's intentionally defective.

18) Worsening the issue with Windows dates

Microsoft has recently made a huge return of investment in terms of public image (in technical circles) on the subject of Windows dates, you know, the fact that dates in Excel spreadsheets are using a Windows encoding, flawed in multiple ways. Microsoft was quick to announce they took great care of the issue by implementing in Excel 2010 a truely standard date type, namely something that would respect ISO standards for once, i.e. ISO 8601.

Well the implementation is far worse than one can imagine.

  • Excel 2010 does implement cell date types for spreadsheets, but dates are found in multiple other places including in spreadsheet files. So this is not a converging implementation meant to avoid interoperability issues. Rather, it just gives more work to implementers.
  • Second, Excel 2010 does implement cell date types for spreadsheets, but it's off by default. It's an option, and it's unchecked by default. The mechanism is enforced only if the user checks the corresponding flag in advanced Excel options. Expecting users to find and check a vague option hidden in the mass of Excel options is an insult to those who expected Microsoft to do something good at least once. This pretty much guarantees that spreadsheets with truely ISO 8601 dates will never exist out there. Clever Microsoft.
  • Third, Excel 2010 does implement cell data types for spreadsheet files without letting know the file implementer, which means any application that is not ready to read and parse "d" types in cells will break. The implementation could not be further from implicit XML rules. Instead of having a cell child of date type, Microsoft Excel simply implants a date type right where it isn't expected according to the previous implementation (Excel 2007). Application break guaranteed.

The level of hypocrisy from Microsoft is astounding. If you are not speechless already, you should.

19) XML as bad as binary

Yes another flawed claim is that Microsoft pushes XML, this idea that their software is now more XML native. This is simply not true. Let's take a case with spreadsheet where changing an unrelated attribute in a XML part describing the workbook simply corrupts the charts described in other XML parts. It goes against the premise of XML, what makes XML better than a binary file format, which is that modifying an XML fragment somewhere should not in principle affect other XML fragments.

Here is the flaw :

A simple chart created with Excel 2010 beta

The same file in Excel 2010 beta after a minor modification in the workbook part

Notice how the plot area exploded off the boundaries, killing all the automatic positioning of the chart title and the legend.

Here is how to reproduce the flaw :

- start Excel 2010 beta
- create a spreadsheet, add a few values, and create a chart from those values
- save the file as Chart.xlsx
- close and quit Excel 2010 beta
- rename the file as Chart.xlsx.zip
- unzip it and edit the part called xl/workbook.xml
- in this part, replace lastEdited="5" by lastEdited="1", as in :

<fileVersion appName="xl" lastEdited="1" lowestEdited="5" rupBuild="9114"/>

- zip the file again
- double-click on it

A flaw like this anihilates the reason why an OOXML file was broken into "independent" pieces, which otherwise would have allowed template based scenarios. It's not like it's impossible. It's needlessly complicated and reasons for the file to be actually corrupt just based on some tiny modifications simply arise. It's not better than what used to be the problems with binary files.

Let's what the official specification tells us about the lastEdited attribute. In section 3.2.13, page 1897 :

lastEdited (Last Edited Version)

Specifies the version of the application that last saved the workbook. This attribute is application-dependent.
The possible values for this attribute are defined by the XML Schema string datatype.

So what do we learn? Nothing. Furthermore, "application-dependent" is Microsoft novlang for "undocumented". How clever indeed. Good luck if you are an implementer.

Microsoft, knowing that much of the actual file format information is left for one to guess, has published a site that is supposed to document flaws and misses in the documentation. Related to the lastEdited attribute, the additional note lets us know that the integer should be between 0 and 32767. How helpful indeed...