tag:blogger.com,1999:blog-2833906749239405322024-03-12T16:09:20.830-07:00OOXML is defective by designOOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comBlogger13125tag:blogger.com,1999:blog-283390674923940532.post-70122834980344734782010-05-10T13:18:00.000-07:002010-05-11T06:37:07.695-07:00The OOXML interoperability scam<span style="font-style:italic;">Stéphane Rodriguez, May 2010</span><br /><br />Every time the Microsoft Office team pushes a comment on the wire, there is another pledge for interoperability. It has been so common for the last few years that if you haven't actually watched what it might mean, pretty much OOXML is synonym with interoperability.<br /><br />Of course, it does not matter that the word interoperability alone does not mean anything. That is why Microsoft uses it so much. You can pretty much put an interoperability label to anything as long as it is not accurately defined. Does it mean document-level interoperability? Application-level interoperability? Or, perhaps is it just Microsoft-only interoperability (a good guess!)?<br /><br />The pledge for interoperability cannot possibly mean document-level interoperability since we are not there : OOXML is full of non-XML streams, barely defined at all (the official papers lack everything related to international features, and that is just one example), so that ends any serious discussion precociously. In the remainder of his article, I'll be taking a look at application-level interoperability, in case Microsoft means that.<br /><br />Just out of generosity I'll be doing tests with Office 2010, the latest version their crappy product suite, allowing them to take the time (no less than 3 years) worth of improvements for features that are supposed to be part of Office 2007 already (which RTM'ed in Nov 2006). Specifically I'll be focusing on Excel 2010 since I've already mentioned a lack of application-level interoperability in my <a href="http://ooxmlisdefectivebydesign.blogspot.com/2010/01/shaving-off-standard-xml-for.html">article</a> about content controls in Word (pointing out that so-called XML features like this actually require running Word instances to work, contrary to the open standard pledge).<br /><br />A good example, when it comes to application-level interoperability, is what is stored in the Windows clipboard when you copy/paste content to or from an Excel spreadsheet. Indeed, Microsoft Office has been used to store binary formats in the clipboard, i.e. proprietary, therefore limiting interoperability across applications. Scenarios like this include the ability to copy/paste content with high fidelity between a running instance of Excel and a running instance of another application (related to spreadsheets or not, assumging it's a OOXML client). Therefore a good measure of whether or not Microsoft has improved interoperability in the OOXML timeframe is whether Excel 2010 actually stores OOXML in the clipboard. Let's see if that is the case.<br /><br />To run the test, we simply create a trivial spreadsheet with a few values in cells and a couple graphics :<br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability1.jpg"><br /><i>A simple spreadsheet to run the test</i><br /><br />The test consists in selecting this cell area and hitting Ctrl-C. That stores the content in the clipboard (hosted by the Windows shell). We know that if we hit Ctrl-V in a new spreadsheet, the content will be pasted with high fidelity, leading us to believe that not only everything needed to do just that is stored in the clipboard, but given the advances and openness of OOXML, we might be able to do that from a separate application.<br /><br />Let's see what is in the clipboard at that moment. To do that, we simply use the built-in Windows clipboard viewer : hit Windows+R, enter clipbrd.exe and hit enter. We can see, by default, a text version of what is supposed to be there :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability2.jpg"><br /><i>The content of the clipboard</i><br /><br />The clipboard stores content in one or more formats, either standard or proprietary. The clipboard viewer lists them :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability3.jpg"><br /><i>Storing part of a spreadsheet : internal file formats in the clipboard</i><br /><br /><ul><br /><li>Enhanced metafile (0 byte)<br /><li>Metafile (16 bytes)<br /><li>Bitmap (0 byte)<br /><li>Unicode text (254 bytes)<br /><li>Text (127 bytes)<br /><li>Displayed text (14 bytes)<br /><li>Regional parameters (4 bytes)<br /><li>OEM text (127 bytes)<br /><li>DIB Bitmap (829320 bytes)<br /><li>DataObject (4 bytes)<br /><li>Biff12 (6219 bytes)<br /><li>Biff8 (15872 bytes)<br /><li>Biff5 (10752 bytes)<br /><li>Sylk (1615 bytes)<br /><li>DIF (848 bytes)<br /><li>XML Spreadsheet (1972 bytes)<br /><li>HTML Format (13763 bytes)<br /><li>Csv (127 bytes)<br /><li>Hyperlink (156 bytes)<br /><li>Rich Text Format (32768 bytes)<br /><li>Embed source (161348 bytes)<br /><li>Native (161348 bytes)<br /><li>OwnerLink (35 bytes)<br /><li>Object Descriptor (154 bytes)<br /><li>Link Source (188 bytes)<br /><li>Link Source Descriptor (154 bytes)<br /><li>Link (45 bytes)<br /><li>ObjectLink (54 bytes)<br /><li>Ole Private Data (728 bytes)<br /></ul><br /><br /><br />In the screen capture above, non-grayed formats are standard and implemented by pretty much all applications, whereas grayed-out formats are proprietary and application-specific (Excel). Of all formats, only a couple can actually contain the relevant content in high fidelity. These are :<br /><ul><br /><li>Biff12<br /><li>Biff8<br /><li>Biff5<br /><li>XML Spreadsheet<br /><li>HTML Format<br /></ul><br /><br />Let's take a look at the list we have.<br /><br /><ul><br /><li>BIFF is the acronym of the binary Excel file format. 5, 8 and 12 are revisions of the file format.<br /><li>XML Spreadsheet is actually Excel 2003's data-only XML file format, a file format deemed so poor that it was completely rewritten to what became OOXML. And of course, this is just data so we lose the graphics.<br /><li>HTML format : this is spaghetti HTML-MSO markup for pure display purposes. Graphics are expressed in terms of bitmaps so we would lose the definition.<br /></ul><br /><br />As a result, only the BIFF formats are really at our disposal. It goes even worse when you realize that BIFF12 is actually not BIFF at all. In fact, in every single Excel release, Microsoft creates a new revision of the BIFF file format and for some reason it is still called BIFF8. BIFF12, on the other hand, is a completely new file format. It is Zipped and respects the OPC packaging, but it is made of .bin entries which store a new file format which is neither the regular BIFF nor XML.<br /><br />The conclusion is that <b>there is actually no way for an OOXML consumer application to rely on the "standard" OOXML to interoperate with Excel 2010 (and Excel 2007)</b>. It's all back to binary formats even though, by registering proprietary formats in the clipboard, the Excel team had years to implement this opportunity to store real OOXML in there. What is a "standard" good for if it's not used by the one and only reference application out there? This is lousy engineering at its worst. Or, for what matters, lock-in strategy.<br /><br />The article could end here, but just to show how backwards is just about everything Microsoft does, let's take a look at how Excel interoperates with itself when it comes to actually realizing a Ctrl-C / Ctrl-V.<br /><br />Let's take our test spreadsheet again. Select the cell area and hit Ctrl-C. Now create a new workbook in the same Excel instance while leaving the other workbook open, then hit Ctrl-V. Everything is copied there, in high fidelity including graphics.<br /><br />Now if you create the new workbook in a new Excel instance (so you have two running Excel instances), hitting Ctrl-V will not copy the graphics and will go as far as incorrectly copy the data : see for instance we had the list of values {2,3,5} and they have been turn to {2-Jan-00, 3-Jan-00, 5-Jan-00}. Formulas are lost and everything. Which is utterly ridiculous.<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability4.jpg"><br /><i>Copying content across two Excel instances is inaccurate, incorrect and incomplete</i><br /><br />At this point, we can conclude that either copying content accurately across Excel instances does not work at all (so a few hundred million clients out there are using a broken product), or it incorrectly uses the content in the clipboard, or the way it works is when the SAME Excel instance communicates across workbooks in order to pass internal data structures and graphics structures. Which means the clipboard is just for second class or third class citizens, and copy/paste is simply unreliable unless you are doing it with Excel. Which is <b>the very definition of a lock-in product</b>. Microsoft has every right to sell lock-in products, but they should not try to lie governments and the public out there : the OOXML standard is no evidence of any form of application-level interoperability. We have just proven it with a trivial scenario.<br /><br />Just for the record, the BIFF12 file in the clipboard does not contain graphics at all. So even what is stored in the clipboard is just second class or third class citizenship.<br /><br /><h2>Comparing OpenOffice and Microsoft Office</h2><br /><br />Microsoft Office is just a disaster. Out of sheer curiosity, and out of fairness, it is interesting to see how OpenOffice compares. Remember that, in the public out there, Microsoft apologists call OpenOffice a simplistic product that does not even support formulas.<br /><br />Again, we'll be sticking with OpenOffice calc, the counterpart of Microsoft Excel. Tests will be done with OpenOffice 3.2, the latest major release of the product suite.<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability5.jpg"><br><br /><i>Comparing the application-level interoperability in OpenOffice 3.2</i><br /><br />The content in question is a replica of what has been used so far. In OpenOffice, we select the cell area around a few cells, formulas and graphics. And we hit Ctrl-C. And then looking up the clipboard, we have the following formats in there :<br /><br /><ul><br /><li>DataObject (4 bytes)<br /><li>Star Embed Source (XML) (10563 bytes)<br /><li>Star Object Descriptor (XML) (77 bytes)<br /><li>GDIMetaFile (28597 bytes)<br /><li>CF_ENHMETAFILE (0 bytes)<br /><li>CF_METAFILEPICT (16 bytes)<br /><li>CF_DIB (632648 bytes)<br /><li>Windows Bitmap (632662 bytes)<br /><li>HTML (HyperText Markup Language) (10679 bytes)<br /><li>HTML Format (10785 bytes)<br /><li>CF_SYLK (164 bytes)<br /><li>Link (36 bytes)<br /><li>CF_DIF (1422 bytes)<br /><li>CF_UNICODETEXT (472 bytes)<br /><li>CF_TEXT (236 bytes)<br /><li>Rich Text Format (5876 bytes)<br /><li>Ole Private Data (568 bytes)<br /><li>CF_LOCALE (4 bytes)<br /><li>CF_OEMTEXT (236 bytes)<br /><li>CF_BITMAP (0 bytes)<br /></ul><br /><br />Off these, only a couple can represent the content in full fidelity (given their size and name). The potential candidates are :<br /><br /><ul><br /><li>Star Embed Source (XML) (10563 bytes)<br /><li>HTML (HyperText Markup Language) (10679 bytes)<br /><li>HTML Format (10785 bytes)<br /></ul><br /><br />Obviously HTML won't be kept much longer for further inspection since HTML is by design spaghetti markup designed for display purposes only. Graphics are replaced by bitmaps so we lose the definitions.<br /><br />In fact only one format remains, Star Embed Source. Chances are that this name comes from Star Office, the original name of the OpenOffice project. And indeed, upon inspection, it's an .ODS file, that is a full fidelity copy of the content.<br /><br />If we grab the content of the clipboard, make sure to call it a .ODS file and open it in OpenOffice, it shows the following :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability6.jpg"><br /><i>Grabbing the content of the clipboard : full fidelity .ODS file</i><br /><br />It has everything you inspect, including formulas, formatting and graphics themselves. So if you were an ODS client, just reading the clipboard, you are able to interoperate with OpenOffice. This is the very definition of application-level interoperability. And it's very surreal that it works with OpenOffice, and not Microsoft Office...<br /><br />Now for the second test, let's see if we copy the content and paste it in a new workbook of another instance of OpenOffice, does it remain in full fidelity?<br /><br />Simply select the cell area again, hit Ctrl-C, then go in the OpenOffice install folder (double-clicking on the OpenOffice shortcut in the desktop will not start a new instance so we have to workaround that), unfold c:\Program files\OpenOffice 3\program and double-click on scalc.exe. A new OpenOffice instance starts. Hit Ctrl-V. And you can see for yourself that everything is there : cells, formatting, formulas, graphics.<br /><br />Simple tests like this leave me a bit speechless when you see that Microsoft Office is supposed to be the rolls royce of Office programs in the world, the de facto standard. And in fact it's just crap. On the contrary OpenOffice, the free suite, is actually a more serious product when it comes to application-level interoperability. This had to be said...OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-87096273771037713052010-01-08T05:14:00.000-08:002010-01-08T09:54:27.617-08:00Shaving off standard XML for proprietary stuff<i>Stéphane Rodriguez, Jan 2010</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2009/11/office-2010-operation-barbarossa.html">Office 2010, "operation Barbarossa" edition</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2009/05/microsofts-latest-aggression-on-odf.html">Microsoft's latest aggression on ODF, codenamed "cast lead"</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/07/beating-dead-horse.html">Beating a dead horse</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/follow-up-on-microsoft-latest-bullshit.html">Follow up on Microsoft latest bullshit announcement</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft latest bullshit : native support of ODF in Office 2007</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />To get a feel of what happened when Microsoft shaved Word 2003 and 2007 off the XML feature which is, as we are being told, infringing one of <a href="http://www.i4ilp.com/">I4I's patents</a>, all you have to do is start Word 2003 and show the help menus.<br /><br />Unfold the XML section, and depending if you are online or not, you'll see a different thing. First start with the offline Word 2003 as shown below :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageWord2003OfflineHelp.jpg" border=1><br /><i>Word 2003's offline help uncovers the XML features</i><br /><br /><br />And then to go online, only to see the entire XML section shaved off :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageWord2003OnlineHelp.jpg" border=1><br /><i>Word 2003's online help disclosing shaved off XML features</i><br /><br /><br />Ironically enough is the fact that all XML features before the cut off were regular, shall we say interoperable features, such as XML documents, XML schemas (standard XSD), document binding (standard XSLT), validation and so on. Pretty much your perfect XML toolkit.<br /><br />It's now shaved off. And Microsoft has made it mandatory by pushing updates of their Office suite touching present and previous Word releases.<br /><br />(as a side note, only Word is touched because neither Excel or Powerpoint ever implemented a standard XML toolkit. All their built-in or canada dry XML was already proprietary crap à la Microsoft).<br /><br />Since those features are shaved off, one would expect XML to be entirely off of Word. Nope.<br /><br />In fact XML is still implemented there, since in fact the Word team (the horse behind it being Mr Brian Jones, whose name appears in a gazillion patents related to Office Open XML formats, the ISO file format bought by and for Microsoft) tried many times to come up with some kind of XML stuff. They never got there on the first try, but the feature remained there when they came up with another attempt. And the second try remained there when they came up with the third try and so on. Long story short, there is XML crap all over the place in Word, resulting from previous failed attempts.<br /><br />At the time of writing they still have smart tags, smart documents, content controls. And more.<br /><br />Now what is of interest is Microsoft is touting as the alternative XML features made available for customers and that does not infringe, at least they tell us so, with the I4I's patents.<br /><br />The stuff in question is called content controls. It used to be called Custom XML. See a previous <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">article</a> I wrote in which I have excerpts from a few Microsoft Office guys telling us what this thing is, including Brian Jones.<br /><br />Of course Custom XML is now called "content controls" since Custom XML is too close to I4I's patents. It's obvious that it touches the patents and therefore should be removed from the product as well.<br /><br />So what are content controls you ask? The Office Word team tells us in a couple articles they wrote. You can look <a href="http://blogs.msdn.com/microsoft_office_word/archive/2006/11/17/meet-the-controls.aspx">here</a> for the definition, and <a href="http://blogs.msdn.com/microsoft_office_word/archive/2007/01/10/separate-yet-equal.aspx">here</a> for a few tutorials for developers. Those articles were written with much innocence back in 2006.<br /><br /><br /><img src="http://tristand.officeisp.net/wordblog/Pictures/010907_0151_SeparateYet2.png" border=1><br /><i>Microsoft's response to the infringing XML : more infringing XML...</i><br /><br /><br />Content controls are an extension of ActiveX controls, itself based on the 2-decade old OLE technology. See the kind of water we are treading here...<br /><br />Content controls are for instance rich text controls, date controls and so on. It's 100% proprietary. To render, activate, bind a content control at run-time is Microsoft proprietary, undocumented and therefore subject to implementation restrictions. Again, for more, take a look at Brian Jones's and al. patents on the subject <a href="http://www.patentgenius.com/inventedby/JonesBrianRedmondWA.html">here</a>.<br /><br />Running content controls requires a license of Word.<br /><br />Let me repeat. $Running$ $content$ $controls$ $requires$ $a$ $license$ $of$ $Word$.<br /><br />To summarize, Microsoft shaved off interoperable and regular XML technologies (XSD schemas, XSLT transforms) in favor a proprietary alternative, content controls.<br /><br />Who really benefits? That is left as an exercise to the reader.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-69977580794256701782009-11-24T05:27:00.000-08:002010-01-09T01:46:47.693-08:00Office 2010, "operation Barbarossa" edition<i>Stéphane Rodriguez, November 2009</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2009/05/microsofts-latest-aggression-on-odf.html">Microsoft's latest aggression on ODF, codenamed "cast lead"</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/07/beating-dead-horse.html">Beating a dead horse</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/follow-up-on-microsoft-latest-bullshit.html">Follow up on Microsoft latest bullshit announcement</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft latest bullshit : native support of ODF in Office 2007</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />With the public release of the latest crap from Redmond in the form of a beta preview of Office 2010, I thought it was a good time to revisit some blatant design and interoperability flaws in the file format, a file format which in and of and itself is more a farce and a fraud, an embarassment for whoever with a software engineering grade. Coming from Microsoft however, the digital nazis of our time, that was expected. <br /><br />What is discussed here is not an exhaustive list of flaws which, if fixed, would make it a proper file format. There is no "just a few bad apples" thingy going on here. It is fatally flawed. This file format is defective by design. Furthermore it was designed and maintained by persons who should not be allowed to touch a keyboard ever again.<br /><br />As we are going to see, even though no less than four years have passed (the beta of Office 2007 was made available in late 2005), the situation hasn't changed a bit. Microsoft has obviously concentrated their budgets elsewhere, thereby exemplifying that their agenda was not to come up with an XML-based file format for Office files which would be useful for the general interest, but instead to rubber timestamp whatever piece of crap they had and then move on. As can be seen by anyone putting his hands on it, XML is in fact a pretext. The very design of the file format makes them not suitable at all in a standards world revolving around XML tools, mappings and data/application platform interoperabilities.<br /><br />Lest not forget that OOXML was defined not as an XML format per se, but as a mapping meant to preserve a so-called binary legacy. A mapping whose absence in the ECMA/ISO papers made the ISO blitz a fraud. Now Microsoft has since gone even further by beginning a global ethnic cleansing policy, the introduction of a malware called MS-ODF, i.e. a Microsoft canada dry version of the ODF file format, whose intent is not secret : what can be the purpose of deliberately storing Microsoft own proprietary formulas and other details in the ODF files (obviously expecting ODF formulas), and to limit the support of ODF to just a limited set of features ? Interestingly, while they had resources to cover for the planned genocide of existing ODF assets, they apparently did not have such resources for coming up with a Save As ISO 29500 format in Microsoft Office 2007 or Microsoft Office 2010, i.e. the claim with which they began the aggression war on civil people. This basic fact is now routinely hidden behind words. The Gauleiters at Microsoft claim to support transient OOXML files not strict OOXML files. The only problem is that "transient" means just about anything they want, i.e. including undocumented stuff, breaking changes, whatever pleases them. While "strict" means the actual ISO papers they fought so much for, to the point of bribing people left and right when the disastrous quality of the draft was pointed out by good willing parties. Microsoft does not intend to support what they fought for. Redmond-Nuremberg trials pending.<br /><br />Anyway, here is a summary of flaws I came up with <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">two years ago</a>, and for each flaw I revisit whether its status has improved in Office 2010 beta.<br /><br />1) Self-exploding spreadsheets<br />==> <font color=red>problem remains as is. More about it below.</font><br /><br />2) Entered versus stored values<br />==> <font color=red>problem remains as is. Excel derailed the IT world two years ago with a <a href="http://blogs.zdnet.com/hardware/?p=812">floating-point flaw</a>.</font><br /><br />3) Optimization artefacts become a feature instead of an embarrasment<br />==> <font color=red>problem remains as is</font><br /><br />4) VML isn't XML<br />==> <font color=red>problem remains as is</font><br /><br />5) Open packaging parts minefield<br />==> <font color=red>problem remains as is</font><br /><br />6) International, but US English first and foremost<br />==> <font color=red>problem remains as is. More to say about it below.</font><br /><br />7) Many ways to get in trouble<br />==> <font color=red>problem remains as is</font><br /><br />8) Windows dates<br />==> <font color=red>problem made even worse. See below.</font><br /><br />9) All roads lead to Office 2007<br />==> <font color=red>problem remains as is</font><br /><br />10) A world of ZIP+OLE files<br />==> <font color=red>problem remains as is</font><br /><br />11) Document security is a (bad) joke<br />==> <font color=red>problem remains as is</font><br /><br />12) BIFF is gone...not!<br />==> <font color=red>problem made even worse. Introduction of a new binary file format, BIFF14, with .XLS file extension to account for new features</font><br /><br />13) Document backwards compatibility subject to neutrino radio-activity<br />==> <font color=red>problem made even worse. See below</font><br /><br />14) ECMA 376 documents just do not exist<br />==> <font color=red>problem remains as is. Oviously ECMA 376 continues to be just vaporware. There is simply no "Save as ECMA 376" or "Save as strict ISO 29500" option available.</font><br /><br />15) How the ISO OpenDocument format (ODF) compares?<br />==> <font color=red>problem made even worse. with the introduction of the MS-ODF file format, incompatible with the ISO ODF standard, the situation was intentionally made worse.</font><br /><br /><br /><br /><br />Now that we have a pretty good idea of how "good" (NOT!) Office 2010 is, let's add a few more flaws.<br /><br />16) Self-exploding charts<br />17) International, but US English first and foremost<br />18) Worsening the issue with Windows dates<br />19) XML as bad as binary<br /><br /><br /><br />16) Self-exploding charts<br /><br />All it takes to get an idea of how much Microsoft pays attention to backwards compatibility is a counter-example. I show this case of a trivial .XLS spreadsheet including a chart, which opens well in all Excel versions including Excel 2007, but explodes in Excel 2010 beta. This spreadsheet is generated by a third-party component. Obviously Excel 2010 beta introduces breaking changes in how it parses .XLS spreadsheets. In this case, the legend is destroyed and the plot area bounces off the boundaries.<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageDefectiveOffice1_Excel2003.jpg" border=1><br /><i>A trivial .XLS spreadsheet including a chart, opened in Excel 2003</i><br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageDefectiveOffice1_Excel2007.jpg" border=1><br /><i>The same file, opened in Excel 2007 SP1</i><br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageDefectiveOffice1_Excel2010.jpg" border=1><br /><i>The same file, open in Excel 2010 beta</i><br /><br /><br /><br />17) International, but US English first and foremost<br /><br />All it takes to get an idea of how much Microsoft respects the assets of their own customers is to use some of the functions exposed by Excel. Just creating a spreadsheet using an Excel language version, say in English, and handing it to someone owning a different language, say in French, results in a corrupt spreadsheet with no way to fix it.<br /><br />Let's thus take the CELL() function. It's a helper function meaning to extract content or information from an arbitrary cell. The problem is, by design, the first parameter passed to this function is a string which is only evaluated in the corresponding language version of Excel. <br /><br />In a spreadsheet created using a French version of Excel, let's insert in cell A1 the following function : <code>CELLULE("adresse";B2:B5)</code>. The function returns $B$2. Now if you open this spreadsheet in an English version of Excel, the A1 cell now returns an error : #VALUE! and a floating tooltip which says : "A value used in the formula is of the wrong data type".<br /><br />Well, not only the error message is incorrect, but one immediately sees the distribution effect of poorly designed cell functions.<br /><br />Guess what, this has not been fixed in OOXML.<br /><br />As a side note, here is how Excel 2007 or Excel 2010 beta store the mentioned function : <code>CELL("adresse",B2:B5)</code>. Which means Excel went through several legitimate conversion steps in order to make it US-English, but not everything. "adresse" should have been converted to "address". It is amazing that this tool is used outside the USA. Any serious global compliance draft for platform interoperabilities would crush Excel as not suitable for such purpose.<br /><br />Of course, a true fix is to create a <code>CELL.ADDRESS()</code> function to address the matter. And to make sure in the public documentation that the locales of Excel functions are listed. <br /><br />You would think not only all of the locale flavors of the function and "address" strings would be documented. They are not. You would also think that Excel's user interface warns the user against using such function. It does not.<br /><br />If you are interested in the (fake) ongoing debate between ODF and OOXML, you must read Microsoft apologists pretend that ODF does not support formulas and therefore OpenOffice Calc is just a simplistic spreadsheet tool unlike Microsoft Excel. Not only is this a lie from a user point of view, but from a standards point of view, there is much left to be desired in the Excel OOXML documentation. Where is the map of function names in the 35+ Excel locales?<br /><br />If that's not amateur work, it's intentionally defective.<br /><br /><br />18) Worsening the issue with Windows dates<br /><br />Microsoft has recently made a huge return of investment in terms of public image (in technical circles) on the subject of Windows dates, you know, the fact that dates in Excel spreadsheets are using a Windows encoding, flawed in multiple ways. Microsoft was quick to announce they took great care of the issue by implementing in Excel 2010 a truely standard date type, namely something that would respect ISO standards for once, i.e. ISO 8601.<br /><br />Well the implementation is far worse than one can imagine.<br /><br /><ul><br /><li>Excel 2010 does implement cell date types for spreadsheets, but dates are found in multiple other places including in spreadsheet files. So this is not a converging implementation meant to avoid interoperability issues. Rather, it just gives more work to implementers.<br /><li>Second, Excel 2010 does implement cell date types for spreadsheets, but it's off by default. It's an option, and it's unchecked by default. The mechanism is enforced only if the user checks the corresponding flag in advanced Excel options. Expecting users to find and check a vague option hidden in the mass of Excel options is an insult to those who expected Microsoft to do something good at least once. This pretty much guarantees that spreadsheets with truely ISO 8601 dates will never exist out there. Clever Microsoft.<br /><li>Third, Excel 2010 does implement cell data types for spreadsheet files without letting know the file implementer, which means any application that is not ready to read and parse "d" types in cells will break. The implementation could not be further from implicit XML rules. Instead of having a cell child of date type, Microsoft Excel simply implants a date type right where it isn't expected according to the previous implementation (Excel 2007). Application break guaranteed.<br /></ul><br /><br />The level of hypocrisy from Microsoft is astounding. If you are not speechless already, you should.<br /><br /><br /><br />19) XML as bad as binary<br /><br />Yes another flawed claim is that Microsoft pushes XML, this idea that their software is now more XML native. This is simply not true. Let's take a case with spreadsheet where changing an unrelated attribute in a XML part describing the workbook simply corrupts the charts described in other XML parts. It goes against the premise of XML, what makes XML better than a binary file format, which is that modifying an XML fragment somewhere should not in principle affect other XML fragments.<br /><br />Here is the flaw :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageDefectiveOffice2_Excel2010_1.jpg" border=1><br /><i>A simple chart created with Excel 2010 beta</i><br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/ImageDefectiveOffice2_Excel2010_2.jpg" border=1><br /><i>The same file in Excel 2010 beta after a minor modification in the workbook part</i><br /><br />Notice how the plot area exploded off the boundaries, killing all the automatic positioning of the chart title and the legend.<br /><br /><br />Here is how to reproduce the flaw :<br /><br />- start Excel 2010 beta<br />- create a spreadsheet, add a few values, and create a chart from those values<br />- save the file as Chart.xlsx<br />- close and quit Excel 2010 beta<br />- rename the file as Chart.xlsx.zip<br />- unzip it and edit the part called xl/workbook.xml<br />- in this part, replace lastEdited="5" by lastEdited="1", as in :<br /><br /><fileVersion appName="xl" lastEdited="1" lowestEdited="5" rupBuild="9114"/><br /><br />- zip the file again<br />- double-click on it<br /><br />A flaw like this anihilates the reason why an OOXML file was broken into "independent" pieces, which otherwise would have allowed template based scenarios. It's not like it's impossible. It's needlessly complicated and reasons for the file to be actually corrupt just based on some tiny modifications simply arise. It's not better than what used to be the problems with binary files.<br /><br />Let's what the official specification tells us about the <code>lastEdited</code> attribute. In section 3.2.13, page 1897 :<br /><br /><table border=1><tr><td><br /><b>lastEdited (Last Edited Version)</b><br /><br /><i>Specifies the version of the application that last saved the workbook. This attribute is application-dependent.<br />The possible values for this attribute are defined by the XML Schema string datatype.</i></td></tr></table><br /><br />So what do we learn? Nothing. Furthermore, "application-dependent" is Microsoft novlang for "undocumented". How clever indeed. Good luck if you are an implementer.<br /><br />Microsoft, knowing that much of the actual file format information is left for one to guess, has published a <a href="http://www.documentinteropinitiative.org/ECMA-376/reference.aspx">site</a> that is supposed to document flaws and misses in the documentation. Related to the <code>lastEdited</code> attribute, the additional note lets us know that the integer should be between 0 and 32767. How helpful indeed...OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-33164574869018151152009-05-17T01:45:00.000-07:002009-05-17T01:54:18.101-07:00Microsoft's latest aggression on ODF, codenamed "cast lead"<i>Stéphane Rodriguez, May 2009</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/07/beating-dead-horse.html">Beating a dead horse</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/follow-up-on-microsoft-latest-bullshit.html">Follow up on Microsoft latest bullshit announcement</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft latest bullshit : native support of ODF in Office 2007</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br /><br /><br /><br />Once again they did it. Microsoft is telling the world that they are improving interoperability across existing office formats and applications thanks to their native support for the ODF file format, a leading office file format based on existing ISO standards. But it could not be further from the truth.<br /><br />Microsoft are actually killing ODF, like the digital nazis that they are. Kissinger is proud of their spiritual sons.<br /><br />What kind of white phosphorus are they using ?<br /><br />First they don't write to ODF but to a <i>canada dry</i> version that we shall call MS-ODF, a variant filled with countless exploding mines, thrown from the air like any coward would do. Namely they are implanting the proprietary Excel formula syntax right inside files expecting the ODF formula syntax as exposed by all the ODF compatible applications out there. Since formulas are used in many elements such as charts, conditional formattings and so on, it wrecks any serious spreadsheet.<br /><br />Second this <i>canada dry</i> version only barely tries to look like ODF, by implementing a tiny subset of it (listed <a href="http://office.microsoft.com/en-us/excel/HA102877221033.aspx?pid=CH100648071033">here</a> for spreadsheets) and thereby making sure that it is a one-way trip only.<br /><br />Third, and that is where it shows their true spirit, is that no matter how proprietary MS-ODF is, they make sure to read back what they write, and that alone, not what ODF applications write, thus faking the whole round-trip thing with utter lack of respect for actual users facing various files in their daily lives. In other words, it tells them they'd better keep using a Microsoft Office license and make a particular attention to only accepting MS-ODF files as correct or face the wrath of arbitrarily corrupt files. Here Microsoft is building 8-meter tall walls and every single user becomes a Palestinian.<br /><br />It would not be so bad if Microsoft, the digital nazis of our time, had called this new proprietary file format MS-ODF, and put a new file extension to it, <code>.odf.microsoft</code>, or something like that. But calling MS-ODF .odf is all intentional, and fully budgeted by the neo-cons of the 4th reich, what they really do is steal the .odf brand, across the shell implementation. Which brings me to resistance.<br /><br />How resistants should fight back the aggressor ? First, by telling the peaceful world the genocide that is being committed. Second, by making sure to distribute ODF applications that better take advantage of shell associations. Microsoft will not tell, but that is their achille heel.<br /><br />Good luck!<br /><br /><strong></strong>OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-31808380789116548502008-07-07T09:14:00.000-07:002008-07-16T08:56:04.481-07:00Beating a dead horse<i>Stéphane Rodriguez, July 2008</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/follow-up-on-microsoft-latest-bullshit.html">Follow up on Microsoft latest bullshit announcement</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft latest bullshit : native support of ODF in Office 2007</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />It's been 3 months since ISO made that April 1st gag when they declared OOXML a valid candidate for an "open standard", even though it's riddled with patents.<br /><br />Microsoft made the situation even more ridiculous by making available, after April 1st, documents that are absolutely necessary in order to fully implement their file formats. Well, if those documents were not part of the ISO proposal in the first place, then what is the ISO proposal good for? Isn't an "open standard" meant to be implemented by more than one vendor?<br /><br />Completely, utterly, shamelessly, ridiculous. Typical Microsoft.<br /><br />Let's get on with the ridicule. Remember the days before April 1st? A day could not pass without a number of so-called independent companies claiming support for OOXML in one way or another, and telling how good it was. Well, since April 1st, it's like not a single freaking person cares about it. Silence. How so? Wasn't it a fraud to begin with?<br /><br />Let's take the binary migration project that Microsoft launched back in February. Here is a refresher :<br /><br /><blockquote><i>The "Office Binary (doc, xls, ppt) Translator to Open XML" project is now live on sourceforge: <a href="http://b2xtranslator.sourceforge.net/">http://b2xtranslator.sourceforge.net/</a><br /><br />As you may remember, this was a request from a number of national bodies, and while Ecma TC45 believed it was outside of the scope of DIS 29500, they did talk with Microsoft and come to this agreement:<br /><br />Nonetheless, Ecma International discussed this subject with Microsoft Corporation, the author of the Binary Formats. To make it even easier for third party conversion of Binary Format-to-DIS 29500, Microsoft agreed to:<br /><br /> * Initiate a Binary Format-to-ISO/IEC JTC 1 DIS 29500 Translator Project on the open source software development web site SourceForge (http://sourceforge.net/ ) in collaboration with independent software vendors. The Translator Project will create software tools, plus guidance, showing how a document written using the Binary Formats can be translated to DIS 29500. The Translator will be available under the open source Berkeley Software Distribution (BSD) license, and anyone can use the mapping, submit bugs and feedback, or contribute to the Project. The Translator Project will start on February 15, 2008. <br /> * Make it even easier to get access to the Binary Formats documentation by posting it and making it available for a direct download on the Microsoft web site no later than February 15, 2008. The Binary Formats have been under a covenant not to sue and Microsoft will also make them available under its Open Specification Promise (see www.microsoft.com/interop/osp) by the time they are posted. <br /><br />We will modify DIS 29500 to include an informative reference to the SourceForge project.<br /><br />While the project is still in its infancy, you can see what the planned project roadmap is, as well as an early draft of a mapping table between the Word binary format (.doc) and the Open XML format (.docx). </i></blockquote><br /><br />How is this project going on? Let's see for yourself on this <a href="http://b2xtranslator.svn.sourceforge.net/viewvc/b2xtranslator/trunk/">web page</a>. The project is still a 1st revision source code dump, and it's 4-month old. It's hard not to laugh.<br /><br />Who thought Microsoft was serious when they started this project? Everyone worth his salt knows that a project like this involves an almost complete rewrite of both engines, and it could take a decade to do so. It's ridiculous to think that a company or independent people would spend their lives essentially rewriting Microsoft Office code base (the non UI part). After all, isn't it what was essentially done already with OpenOffice? Why isn't Microsoft instead pledging support for the OpenOffice suite by helping implement the undocumented stuff? Alternatively, why don't they instead open source their compatibility pack, a component that migrates Office documents back and forth?<br /><br />It gets better.<br /><br />Earlier this month, Microsoft released another 5000+ pages of documentation. This additional documentation is a direct acknowledgement that what I have been saying on this blog was spot on, which is that the documentation that was made available earlier was just a fraction of what was needed to implement a full run-time of Office documents. At least Microsoft gives way to a so-called anti-Microsoft person. How ironic that is. Well until you understand that the Microsoft bloggers were actually backing me and my products (<a href="http://diffopc.arstdesign.com/">diffopc+</a>, <a href="http://xlsgen.arstdesign.com/">xlsgen</a>, ...) until I started becoming vocal and a critic on the subject. In other words, they were backing me when I was saying positive things about it. Those are not technical people, those are stinking marketing people ready to bribe.<br /><br />If we take a look at the new documentation, it gets a little interesting. First of all, let's say that if you are not one of the 5 persons on the planet who has been involved in hardcore BIFF/Word/PPT/MSO, you are just wasting your time here. This puzzle (smells like a typical PM document) can only be understood by people in the trenches.<br /><br />Then, Microsoft does the typical thing, they make it almost impossible to use the document by sorting everything alphabetically instead of logically, by theme. This attitude is kind of, realising valuable information while making life miserable to anyone who'll try to read it. Again, you cannot fully comprehend a behavior like this without a good dose of cynism.<br /><br />Next, this documentation is still lacking plenty of information. For instance, in [MS-XLS].pdf, page 364, we learn :<br /><blockquote>dwBuild (4 bytes): An unsigned integer that specifies the recalculation engine identifier of the recalculation engine that performed the last recalculation. If the value is less than the recalculation engine identifier associated with the application, the application will recalculate the results of all formulas on this workbook immediately after loading the file.</blockquote><br /><br />It is clear that it's up to everyone I guess to figure out the build numbers used by shipped Excel products over the years. Failing to find those numbers implies a recalculation of the spreadsheet next time it's open, needless to say, something you want to avoid at all costs.<br /><br />Above all, what I find most shameful is that earlier this year was the time Microsoft not only was buying people's voice left and right to obtain the $$ISO$$ rubber timestamp, they were also putting the last touch to Office 2009 (codenamed Office 14). And they have never talked about it. Whatever they have already baked in won't be part of the ISO proposal, making it as useful as an old sock. And doing so, iterating though their fire and motion war strategy which is to come up with new stuff and have competitors spend their time catching up instead of concentrating on their own applications.<br /><br />The ISO episode hasn't changed things a bit in this regard.<br /><br /><font color=red>[Update, july 9] : I wrote this blog post on july 7. When I wrote it, the b2xtranslator project I mention was showing a 4-month old developer branch, with just a single submission in it. Guess what happened in just the matter of two days! Monopoly employees have probably urgently asked their contractors to push something new in the developer branch, to avoid the ridicule. Just two days after I mentioned this blatant deception that Monopoly had created in the first place (a project that will never complete anyway given the amount of work that is needed, and the fact that it makes little sense to do it when the Office compatibility pack already does it), the thing gets "fixed". Very interesting...It smells as if Monopoly is extremely sensitive on this subject while the OOXML appeals are being processed.</font>OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-55882855613011160382008-05-27T04:50:00.000-07:002008-05-28T23:43:45.782-07:00Follow up on Microsoft latest bullshit announcement<i>Stéphane Rodriguez, May 2008</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft latest bullshit : native support of ODF in Office 2007</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br />A quick article to follow up on the latest <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/05/microsoft-latest-bullshit-native.html">Microsoft bullshit</a>. There is a lot of floating rot in the air at the moment. The blogosphere is breezing with wild rumours, and tech reporters take great care echoing Microsoft PR press pass without a slight of skepticism. I want to take the time to explain what is really going on.<br /><br />Microsoft has won. They wanted the ISO timestamp. They got it. They needed it since governments (and the EU) want such thing for documents now.<br /><br />OOXML? It’s just a rough spec, as it has always been with Office file format specs in the past. The share of undocumented stuff in that document is the same than in the past, and I don’t think that it is even intentional. To understand why this is the case, you only have to consider the fact that Microsoft was not prepared to go to ISO when they were working on Office 2007. They got caught by surprise. Then, when government pressure increased, they rushed the spec through with some of the internal documents they had. But it is important to understand that those documents are handy when you have the source code with it, otherwise it’s just a puzzle and any implementer has to go through reverse engineering to really implement anything significant of it. I know this all too well, having been playing this game for many years.<br /><br />Of course, the ISO process was a farce. With all the money they have, Microsoft went through $$ECMA$$$ rubber timestamp organization, and then ISO, an even more grotesque clique.<br /><br />The delay on supporting OOXML? If you believe Microsoft is having a hard time supporting their own file formats, I think you did not get the memo. Most changes checked in through the February 2008 BRM (final step of ISO) were just cosmetic ones. Those who blogged on the BRM made it very clear that <b>Microsoft insisted that no change that would break Office 2007 documents would be accepted</b>. Estimation of time needed to implement the "changes" : a few days. Remember also that the Office team at Microsoft is a thousand people group. In fact, those changes are more appropriately called "migration changes", since they are not "breaking changes".<br /><br />Let's take two examples :<br />- ISO 8601 dates : ISO 8601 dates were implemented in Excel 2003 already. You can easily try that by saving a spreadsheet file as "XML spreadsheet (.xml)" or "XML data (.xml)" in Excel 2003 or Excel 2007. Microsoft changed their mind in Excel 2007.<br />- VML : Microsoft will simply put VML under a DrawingML namespace and call it part of DrawingML despite the fact that implementers will continue to be burdened by a redundant graphics file format. In addition, the existing documents containing VML generated by Microsoft Office (MHTML, WordML 2003, ExcelML 2003, PowerpointML 2003, some substantial features in WordML 2007, some substantial features in ExcelML 2007, some substantial features in PowerpointML 2007) guarantee that VML has to implemented by any serious competitor. Worse, with VML and its variants over the years, this is as much variants to implement. This has a name : fire and motion. And the goal is to protect Microsoft Office's bottom line.<br /><br />The native (and naive) support of ODF in Office 2007? That is a trick to please governments (and the EU) so that Microsoft appears as a nice group of folks, not the digital nazis that they are. The implementation of ODF will be horrible, the inter-application capabilities such as copy/paste will be inexistent. And implementing ODF 1.1 when ODF 1.2 is already available is a move intended to stall ODF forever.<br /><br />Future of OOXML? There are two answers. Frankly, who freaking cares the paper? This paper will ALWAYS be at odds with the actual Office implementation. We have a good example for the time being, but it has always been the case. What about the actual file format then? It will be the subject of reverse engineering from implementers whose only recourse is to catch up all the undocumented stuff. Make no mistake though, now it's about applications, not documents anymore.<br /><br />Conclusion : if you are still in the OOXML conspiracy game, about time to move on guys.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-87874439410446169402008-05-22T01:06:00.000-07:002008-05-22T09:04:35.021-07:00Microsoft latest bullshit : native support of ODF in Office 2007<i>Stéphane Rodriguez, May 2008</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/custom-xml-what-custom-xml.html">Custom XML? What Custom XML?</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />I wanted to post a quick reaction to the latest Microsoft bullshit <a href="http://www.microsoft.com/Presspass/press/2008/may08/05-21ExpandedFormatsPR.mspx">announcement</a>, in which they reportedly plan to "add native support for ODF 1.1". The way they put is very succinct, intentionally probably, and it opens the door for wild guesses.<br /><br />First of all, Microsoft is a huge Office licensing monopoly. It's so big it even surpasses Windows in sales. Any decline in Office licensing would be dramatic for Microsoft's future. With that alone, you know that any announcement from Microsoft that they are willing to interoperate with other people's software, namely applications, should be taken with a grain of salt.<br /><br />Here is how, with the release of Office 2007, Microsoft intends to keep their monopoly in Office licensing :<br /><br /><strong>Phase 1</strong> - as long as there is not enough Office 2007 documents out there, make sure that customers understand that only Office 2007 can reliably migrate binary files to the new file formats. Hence the backwards compatibility claim which are part of the OOXML ISO marketing diversion (ironically inflated by critics).<br /><br />During this phase, which began in 2006 and should take at least 5 years (at least one major organic corporation upgrade cycle), Microsoft bottom line is at risk. The strategy was therefore to make sure to retain the exclusivity when it comes to migrating file formats, and spending money to get this message heard well by the customers. This is exactly what the infuriating OOXML ECMA and then OOXML ISO was about, not anything remotely related to an international standard meant to "improve interoperability across platforms".<br /><br />Technically speaking, only Microsoft can reliably migrate binary files since only they know the implementation required to do so. Have you noticed that Office 97 shipped 11 years ago, and we have yet to hear about a non-Microsoft application that would strictly interoperate with those files? Besides this, the so-called interoperability documents that were made available back in February 2008 are a farce : this is a bare update of the old MSDN documentation, and everything that was undocumented back in Office 97 days remains exactly as much undocumented. No improvement was made with this, as I explained in a previous article, but sure enough Microsoft exploited it to lure the sheeps who did not actually read it. The real message : <b>keep buying Office 2007 licenses, otherwise you will be hurt</b>. Or, somewhat more verbose, if you deploy non-Microsoft Office software, soon enough you'll get document fidelity issues which will damage your business. We provide the only application that ensures full-fidelity, so it's suicidal to use competing products. In other words, <b>competitor exclusion</b>.<br /><br />Needless to say, this fire and motion strategy has been going on for two decades. Microsoft adds a number of features in their spaghetti codebase, announces it when it ships the product, then competitors have to catch up (years of work usually) instead of concentrating on their products capabilities. The irony is, with Office 2007 for instance, Microsoft has been themselves guilty of lack of full-fidelity with the chart engine that they replaced with a new one killing a number of existing features of the old one (rendering and chart options).<br /><br /><strong>Phase 2</strong> - there is enough Office 2007 documents out there. Game over.<br /><br /><br /><br />With that said, a few more words. <br /><br />In regards to competition between file formats, it has to be understood that the point of Microsoft is to ensure that going forward their internal representation of a file in memory remains what encompasses every other file format. In other words, ODF will become a second-class citizen and it is expected that it will be exploited to downgrade or significally hamper the fidelity of open/save ODF files. Classic Embrace, Extend, Extinguish. Likewise, since Office 2007 is not a native XML application (the internal representation is a bunch of binary structure, not XML DOM), it will never be able to become a basis for ODF-based applications that would really take advantage of XML. In other words, XML itself is being deceptively exploited in order to simply preserve the monopoly.<br /><br />In regards to scenarios that would be enabled by "adding native support for ODF 1.1", what about external references from an ODS file to a XLS/XLSX file? What about copy/pasting back and forth preserving data, context and formatting? The list goes on and on.<br /><br />"native support for ODF 1.1" would also imply that the open source projects that were heavily part of the marketing machinery were just that, marketing. Of course, the message was heard by corporate customers since day one (to them "open source projects" means "unsupported"), so there is nothing really surprising here.<br /><br />Last but not least, with ODF 1.2 just around the corner and its dramatic improvements, there has never been a more strategic time for an enemy of ODF to announce support for ODF...1.1. In case you think the last sentence contradicts itself, consider the fact that the massive Microsoft distribution power amassed from two decades of pushing hard Microsoft Office onto everyone's desktop (reportedly 500+ million licenses) guarantees that whatever version of ODF shipping with Microsoft Office, it will be very hard for those actually contributing to ODF in the OpenOffice project to migrate the user base to whatever next version of ODF, say ODF 1.2. This is how you can make any change to ODF and are unable to get it in the hands of the user base. Yet another win for Microsoft.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-82510369727122258292008-03-21T07:39:00.000-07:002010-01-19T01:50:29.332-08:00Custom XML? What Custom XML?<i>Stéphane Rodriguez, March 2008</i><br /><br />More recent article on Custom XML (Jan 2010) :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2010/01/shaving-off-standard-xml-for.html">Shaving off standard XML for proprietary stuff</a><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/backwards-compatible-one-more-lie-by.html">Backwards compatible? One more lie by omission</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />Other than backwards compatibility, the expression "Custom XML" plays an important role in Microsoft ISO OOXML evangelism. It's interesting that Microsoft bloggers don't even seem to be embarassed by ridiculous expressions such as "Custom XML". Custom XML is indeed just as silly as "Office Open XML" : the reason is X in XML already means Custom. So there cannot be a meaningful sense for Custom XML...unless Custom XML is short for Custom XML applications. And from that it makes sense. But there is a problem : Custom XML is part of the ISO proposal of a <b>document</b> file format, whereas Custom XML applications implies the apparatus and logic related to <b>applications</b>, inherently tied to products, platforms and operating systems, not documents. Did Microsoft feel guilty, removed the word <b>applications</b> knowing it wouldn't stand a chance otherwise ? That's for anyone to guess. In this article, we are going to delve into the so-called "Custom XML", and how little useful it is in practice.<br /><br /><br /><br /><H3>Custom XML definition, as per Microsoft</H3><br />Straight from the horse behind it, <a href="http://blogs.msdn.com/brian_jones/archive/2005/11/04/integrating-with-business-data-store-custom-xml-in-the-office-xml-formats.aspx">Brian Jones</a> :<br /><br /><blockquote><i>Custom XML is the support for custom defined schemas. It's that support that allows you truly integrate your documents with business processes and business data. You can define your data using XML Schema syntax, and then you can use that data in your Office documents. By opening up our formats with our reference schemas, and supporting your custom defined schemas, you get true interoperability of your documents. Sorry if this is currently sounding more like a marketing pitch, but I wanted to make sure I reiterated our vision for XML support in Office documents and hopefully that will help you see the power that we see. (...) Up until now we've talked about all the parts that we in Office have defined to create our documents. You as a developer also have the ability to add your own parts though. You can take any XML file and put it inside the ZIP package.</i></blockquote><br /><br />Translation : by XML, we actually mean several completely different things, and we've put all of it in the same pot. We think storing XML-based data inside the ZIP package is an efficient way to share your confidential corporate data to the outside world and we are proud to make it easy to do so. Likewise, storing business data and the document together illustrates the grand Microsoft vision about independent layers.<br /><br /><br />Straight from the marketing people at Microsoft, here Doug Mahugh (<a href="http://www.ibiblio.org/bosak/v1mail/200705/2007May17-104943.eml">answering</a> Patrick Durusau during the INCITS V1 review of OOXML back in <a href="http://www.ibiblio.org/bosak/v1mail/200705/">April 2007</a>) :<br /><br /><blockquote><i>DIS 29500 (OOXML) serves other purposes that are not served by ISO/IEC 26300 (ODF), especially in the area of integration options for external schemas. (By "external" I mean schemas that are not part of the spec itself -- in common usage we tend to call these "custom schemas" as opposed to the "reference schemas" in DIS 29500).</i><br /></blockquote><br /><br />Translation : at Microsoft, we ship products to make it possible to do what you can already do without.<br /><br /><br />Straight from a top Microsoft Office <a href="http://www.microsoft.com/interop/letters/ChrisCapOpenLetter.mspx">brass</a> :<br /><br /><blockquote><i>Open XML allows for custom XML markup within the body of a document which is a handy way to allow users to tag their content for interoperability with other types of software such as a custom line of business system.</i></blockquote><br /><br />Translation : we dare you put foreign XML markup into an existing XML, with no agreed-upon semantics between the two languages, in order to improve the interoperability. What kind of interoperability we are talking about is left as an exercise to the reader.<br /><br /><br /><br /><H3>Enough marketing fluff. What it really is.</H3><br />First, let's get out of the way that "Custom XML" actually means "Custom" "XML" at all.<br /><br /><ul><li>Start Word 2007.<li>Create a new document.<li>Type "test".<li>Save and close the document.<li>Unzip it.<li>Grab the part word/document.xml, you should see the following :<br><br /><pre><br /><w:p><br /> <w:r><br /> <w:t>test</w:t><br /> </w:r><br /></w:p> <br /></pre><br /><li>Now add some custom XML markup :<br><br /><pre><br /><w:p><br /> <w:r><br /> <w:t <b>mytag="myvalue"</b>>test</w:t><br /> </w:r><br /></w:p> <br /></pre><br /><li>Put the edited part back into the ZIP file and open it in Word 2007. It opens perfectly well.<br /><li>Close it. Unzip the file again. Grab the part word/document.xml, you should see the following :<br><br /><pre><br /><w:p><br /> <w:r><br /> <w:t <b>mytag="myvalue"</b>>test</w:t><br /> </w:r><br /></w:p> <br /></pre><br /><li>Now remove the custom attribute and instead add some other custom XML markup, this time an element :<br><br /><pre><br /><w:p><br /> <w:r><br /> <b><mytag>myvalue</mytag></b><br /> <w:t>test</w:t><br /> </w:r><br /></w:p> <br /></pre><br /><li>Put the edited part back into the ZIP file and open it in Word 2007. This time it brings the following error message :<br><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeCustomXML1.gif"><br /><i>Corrupting the document by adding custom XML markup</i><br /><br /><li>Perhaps this is a namespace issue. Let's prefix our custom XML with <b>w</b> so that the fragment looks like the following :<br><br /><pre><br /><w:p><br /> <w:r><br /> <b><w:mytag>myvalue</w:mytag></b><br /> <w:t>test</w:t><br /> </w:r><br /></w:p> <br /></pre><br /><li>Put the edited part back into the ZIP file and open it in Word 2007, and indeed this time it opens well.<br /></ul><br /><br /><br />From those simple tests, we can infer the following :<br /><ul><br /><li>There is no such thing as strict XML schema validation in Word 2007.<br /><br /><li>You can add a custom XML attribute with no namespace (therefore the attribute uses the element's namespace, in our case <b>w</b>), and Word 2007 will not complain even though the <b>mytag</b> attribute is not part of the OOXML reference schemas. See ECMA 376, part 4, page 199, section 2.3.2.23, the w:r element is defined as follows :<br><br /><pre><br /><complexType name="CT_R"><br /> <sequence><br /> <group ref="EG_RPr" minOccurs="0"/><br /> <group ref="EG_RunInnerContent" minOccurs="0" maxOccurs="unbounded"/><br /> </sequence><br /> <attribute name="rsidRPr" type="ST_LongHexNumber"/><br /> <attribute name="rsidDel" type="ST_LongHexNumber"/><br /> <attribute name="rsidR" type="ST_LongHexNumber"/><br /></complexType><br /></pre><br />and therefore does not allow the <b>mytag</b> attribute to be present.<br /><br /><li>You cannot add a custom XML element unless it is prefixed by one of the namespaces declared in the header of the XML document, in our case <b>w</b> (xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"). Rather interesting again since <b>w:mytag</b> does not belong to the OOXML reference schemas, and yet Word 2007 does not complain about it.<br /></ul><br /><br />Needless to say, this is a broken implementation of XML that neither satisfies the "strict XML" camp, nor the "loose XML" camp.<br /><br />It can be summarized by the following table :<br /><table border=1><br /><tr><td colspan=3>Word 2007</tr><br /><tr><td><B>Test</B></td><td><B>Result</B></td><td><B>Strict validation</B></td></tr><br /><tr><td>mytag="myvalue"</td><td>OK</td><td>NO</td></tr><br /><tr><td><mytag>myvalue</mytag></td><td><font color=red>FAIL</font></td><td>N/A</td></tr><br /><tr><td><w:mytag>myvalue</w:mytag></td><td>OK</td><td>NO</td></tr><br /></table><br /><br />And for the other two applications, Excel 2007 and Powerpoint 2007, the results are as follows :<br /><br /><table border=1><br /><tr><td><B>Test</B></td><td><B>Word 2007 result</B></td><td><B>Excel 2007 result</B></td><td><B>Powerpoint 2007 result</B></td></tr><br /><tr><td>mytag="myvalue"</td><td>OK</td><td><font color=red>FAIL</font></td><td><font color=red>FAIL</font></td></tr><br /><tr><td><mytag>myvalue</mytag></td><td><font color=red>FAIL</font></td><td><font color=red>FAIL</font></td><td><font color=red>FAIL</font></td></tr><br /><tr><td><w:mytag>myvalue</w:mytag></td><td>OK</td><td><font color=red>FAIL</font></td><td><font color=red>FAIL</font></td></tr><br /></table><br /><br />Since Word 2007, Excel 2007 and Powerpoint 2007 do not handle Custom XML the same way, as the table above shows, it's very hard to trust Microsoft when they claim that Custom XML is a feature of OOXML.<br /><br />Either application will see the document as a corrupt one is guaranteed to be random, Microsoft cannot possibly imply that the Custom XML they are talking about is what we take for granted when we say "Custom XML", i.e. the ability to add our own XML within the document.<br /><br /><br /><br /><H3>Custom XML, as per ECMA 376</H3><br />Since "Custom XML" does not mean "Custom" "XML", we have to rely on ECMA 376's definition of such thing. The first surprise is that there is a notion of "Custom XML markup" and a notion of "Custom XML data".<br /><br />The second surprise is that the notion of "Custom XML markup" only appears in the documentation for Word documents. If we assume that this "Custom XML markup" will be used to bind the "Custom XML data" to the document at run-time, we can infer from ECMA 376 that this is only made possible for Word documents.<br /><br />In other words, whatever "Custom XML" is, it is only fully implemented for Word, so it should be called "Custom XML in Word". When Microsoft marketing people are trying to sell us "Custom XML" as a feature of OOXML, it is a lie. Let's create a table of what we have just learned.<br /><br /><table border=1><br /><tr><td> </td><td><B>Word 2007</B></td><td><B>Excel 2007</B></td><td><B>Powerpoint 2007</B></td></tr><br /><tr><td>Custom XML data</td><td>YES</td><td>YES</td><td>YES</td></tr><br /><tr><td>Custom XML markup</td><td>YES</td><td><font color=red>NO</font></td><td><font color=red>NO</font></td></tr><br /></table><br /><br />"Custom XML data", ECMA 376 part 4 section 8, supported in both documents, is the ability to store an independent XML stream in the ZIP package. In fact, it is not a feature of OOXML at all, it is a feature of any ZIP archive. After all, a ZIP entry in a ZIP package can be anything, including an XML stream. We can infer from that, that "Custom XML data" is in fact nothing custom : the ability to store an independent XML stream is not something we should thank Microsoft for allowing us to do so.<br /><br />"Custom XML markup", ECMA 376 part 4 section 2.5, supported only in Word documents, is the ability to bind the "Custom XML data" to the document's content, at run-time. Interestingly enough, the Word team at Microsoft haven't quite managed to merge this concept and the old "smart tag" concept. That's why in the ECMA 376 specification, we end up with several flavors of "Custom XML markup", one of which is smart tags, the ability to add metadata to content (eg. stock quote).<br /><br />An interesting element is "run-time". If you write the Custom XML markup that make it possible to do the said data binding, it has to be reminded that the data binding is done by an instance of Word, not a third-party application. So data binding is just a lock-in. That's the difference between standardizing on a document versus standardizing on an application!<br /><br />There is no mechanism for doing such thing in Excel spreadsheets and Powerpoint presentations. In Excel spreadsheets, the XML data binding, a feature available from the user interface, is a special case of data source querying where the data source is an XML stream. The XML stream is external to the ZIP package. In other words, the "Custom XML data" in spreadsheets is useless. In Powerpoint presentations, it's even more trivial since there is no such thing as an XML data binding mechanism from the user interface.<br /><br /><br /><br /><H3>The merit of "Custom XML data"</H3><br />Something interesting to note is that Microsoft thinks that storing data inside the ZIP package independently of the document is a good thing. From a pure technical point of view, you can view this "Custom XML data" as a cache of values thanks to which the consumer is able to drill into the data without a connection to the actual data source (corporate data). But there is a major flaw. Anybody using this feature will end up storing arbitrary data in ZIP packages shared across colleagues and others inside and outside the organization. Eventually, confidential information from the corporate databases will end up there, and a PR disaster automatically follows. You don't want to use this feature.<br /><br /><br /><br /><H3>Conclusion</H3><br />"Custom XML" does not mean much, despite Microsoft ample evangelism of said feature. Technically speaking it has no merit within the enterprise space because you end up sharing corporate data. An interesting fact is that "Custom XML" is actually only implemented in Word 2007. For instance, the ECMA 376 specification provides a data binding for Word 2007 documents, exclusively. Ironically enough, the ability to store an independent XML stream as part of a ZIP package, is just a feature of the ZIP library, not Microsoft's innovation.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-15258704343158799732008-03-19T23:45:00.000-07:002008-03-20T09:13:27.005-07:00Backwards compatible? One more lie by omission<i>Stéphane Rodriguez, March 2008</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">Bad surprise in Microsoft Office binary documents : interoperability remains impossible</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />Ever since Microsoft Office XML file formats were introduced, the excuse for using such poorly engineered markup was that it was designed to be backwards compatible with the so-called billions of existing documents out there.<br /><br />Is it really the truth? Or is Microsoft using it to fuel more fire and motion?<br /><br /><br />Lying by omission, as per Wikipedia :<br /><blockquote><i>Lying by omission is when an important fact is omitted, deliberately leaving another person with a misconception. This includes failures to correct pre-existing misconceptions. One may by careful speaking contrive to give correct but only partial answers to questions, thus never actually lying.</i></blockquote><br /><br /><br /><H3>Backwards and forwards compatibility</H3><br />What's backwards compatibility first of all? For everyone in the world, except for Microsoft, file backwards compatibility means that product in version N can work seamlessly with files produced by the product in version N+1. For Microsoft however, it's the opposite, backwards compatibility means that product version N+1 can work seamlessly with files produced by the product in version N. By "work seamlessly" is meant no visible degradation, no noticeable loss in functionality.<br /><br />To claim that a file format is backwards compatible is therefore quite an ambitious goal, especially if the file format in question is complex. It's so ambitious that it's almost an impossible task in practice. In fact, it behaves much like a theorem in mathematics. To prove the theorem is wrong, you only need one counter-example. To prove backwards compatibility is not met, you just need to come up with one counter-example.<br /><br />So let's take a simple chart and see how backward compatibile it is.<br /><br /><ul><li>Start Excel 2003 (or any Excel version older than Excel 2007)<li>Type jan, hit <TAB>, 3, <TAB> 4, <TAB>, 5<li>Select those cells, click the chart creation button in the main toolbar and choose a column type bar</ul><br /><br />The chart shows up like this :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible1.gif" border=1><br /><i>A simple chart made with an older Excel version</i><br /><br />Now resize the chart, and you can see that everything gets scaled accordingly : text in all chart elements, plot area, ...<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible2.gif" border=1><br /><i>Resizing a simple chart made with an older Excel version</i><br /><br /><ul><li>Now start Excel 2007<li>Open this file<li>The chart shows up</ul><br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible3.gif" border=1><br /><i>The chart file (.xls) opened in Excel 2007</i><br /><br />Now resize the chart, and you can see that fonts don't scale :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible4.gif" border=1><br /><i>Resizing the chart file (.xls) in Excel 2007 does not scale the fonts</i><br /><br />If you would like to resize the fonts, you have to select each chart element individually, and then apply the font size with user interface elements. Quite counter-productive. In addition, since that's manually done, proportions are not kept between chart elements and you cannot easily reproduce the same effect than with older versions of Excel.<br /><br />Now if you save this file as a .xlsx file using Excel 2007, and then convert it back to a .xls file, then open it with an older version of Excel (effectively executing a round-trip scenario), the chart shows up but it does not scale when it is resized :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible5.gif" border=1><br /><i>After the conversion back to (.xls), the fonts do not scale if we resize the chart</i><br /><br /><br /><br /><br />So much for backwards compatibility...<br /><br />In case you wonder what's going on, what happens is that Excel 2007 does not support the auto scale option, something very pervasive in chart formatting options. Here is the guilty option :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeBackwardsCompatible6.gif" border=1><br><br /><i>Root of the problem : Excel 2007 does not support "auto scale" anymore (screenshot from Excel 2003)</i><br /><br /><b>It is not only a feature that is gone from the user interface, it is an option that is lost in the document itself. In other words, sharing this file with any person inside or outside your company will result in discrepancies.</b><br /><br />For your information, the internal Excel BIFF record that Excel 2007 does not support is the FBI record (part of the chart records, and therefore left unspecified in Microsoft's binary <a href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">BIFF documentation</a>, as explained in my previous <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/03/bad-surprise-in-microsoft-office-binary.html">article</a>) :<br /><br /><blockquote>(excerpt from MSDN Library, Feb 1998)<br /><br /><B>FBI: Font Basis (1060h)</B><br />The FBI record stores font metrics.<br /><br /><I>Offset</I> | <I>Name</I> | <I>Size</I> | <I>Contents</I><br /> <br />4 | dmixBasis | 2 | Width of basis when font was applied<br /> <br />6 | dmiyBasis | 2 | Height of basis when font was applied<br /> <br />8 | twpHeightBasis | 2 | Font height applied<br /> <br />10 | scab | 2 | Scale basis<br /> <br />12 | ifnt | 2 | Index number into the font table</blockquote><br /><br /><br />If you are interested in taking a look at the corresponding Excel files :<br /><ul><li><a href="http://www.arstdesign.com/BBS/picsupload/Chart1_stretchtest.xls">Original .xls file</a><li><a href="http://www.arstdesign.com/BBS/picsupload/Chart1_stretchtest.xlsx">.xlsx file (upgraded from the .xls file)</a><li><a href="http://www.arstdesign.com/BBS/picsupload/Chart1_stretchtest_to972003.xls">.xls file (converted from the .xlsx file)</a></ul><br /><br /><br />Note that this is just one simple example. There are many examples of all kinds. For instance, in a previous <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">article</a>, I showed how encrypting a document did encrypt the metadata as well, contrary to older Office application versions, and doing so breaks all applications expecting to access the file's metadata whether the file is encrypted or not.<br /><br />As for forwards compatibility, the same <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">article</a> already showed (BIFF11+ section) that any Excel 2007 feature would be lost or partially lost when converted back to older Excel versions and part of a round-trip scenario. So forwards compatibility is not true either generally speaking.<br /><br /><br /><br /><H3>Behind the scene, what's happening?</H3><br />Behind the scene, what's happening is that Microsoft replaced the chart subsystem in Excel 2007 with a new library but <b>failed to test it enough</b>, presumably because they wanted to rush Office 2007 out the door. Since charts can also be linked to Word and Powerpoint documents, this actually affects all 3 applications' backwards compatibility.<br /><br />The real mistake they made is to kill the old chart subsystem source code. What they should have done to meet their backwards compatibility claim was to use the old chart subsystem for the compatibility mode (old Excel files), and the new chart subsystem for charts in new files. There is no other way they can achieve the claims.<br /><br /><br /><br /><H3>If backwards compatibility is not proven technically speaking, what is Microsoft trying to achieve?</H3><br />In the previous section, it has been demonstrated with a simple counter-example that backwards compatibility leaves much to desire. So if this <i>capability</i> is not technically sound, how come Microsoft keeps hammering it every single day?<br /><br />A Microsoft Office spokeperson made the <a href="http://www.microsoft.com/interop/letters/ChrisCapOpenLetter.mspx">following statement</a> on March 16, 2008 :<br /><br /><blockquote><i>An Open Letter from Chris Capossela, Senior Vice President, Microsoft Office<br /><br />In today's digital world, there is a critical need to access, share and archive information. People want to share information with co-workers, business partners, family members and others regardless of the technology platform or software application being used. People want to be able to store and archive documents so that the information they contain can be accessed well into the future. Long-term access to documents is particularly important in the case of governments given the critical and historical nature of government information and the associated requirements for preservation and access moving forward.<br /><br />Why choose the Ecma Office Open XML File Formats?<br />Office Open XML fulfills an important customer need - the file format was designed to be backward compatible with the content and functionality in billions of existing documents. (...)</i></blockquote><br /><br />The same thing has been hammered a gazillion times by Microsoft since the "new" file formats were announced back in 2005. If you are a software developer, then it does not sound like anything special. In fact, you may even think there is no substance at all in it. But <b>if you wear a CIO cap, it's a different thing</b>.<br /><br />Remember, IT people are responsible for running the systems. For them backwards compatibility is not heard in the technical sense, it is synonym of "won't break systems, no down time". See where it goes?<br /><br />A little more help. Microsoft, by talking over and over about backwards compatibility is actually telling decision makers that they are pledging not to break systems when everyone in the organization uses the applications and those new file formats (remember that for anyone, .docx .xlsx and .pptx files are alien files unless they install Office 2007).<br /><br />In fact, it goes a little deeper. And that's when it gets really interesting. Microsoft is not only saying their stuff is backwards compatible, they are actually saying that they are <b>the only backwards compatible when it comes to Office documents</b>.<br /><br />It is another way to say that the only way to ensure no down time is to purchase licenses of Microsoft Office software. As simple as that.<br /><br />To better understand the picture, consider the following. During the 2007-20xx period, Microsoft is taking a major risk by introducing new file formats that cannot be natively used by older versions of Microsoft Office. It's something they did back in 1995 and it did not go very well. If you remember, this caused havok in IT organizations because Office 95 and Office 97 file formats were substantially different and Office 95 could not open Office 97 documents. Microsoft did break file formats unprepared.<br /><br />So it is without surprise that Microsoft is trying to make sure that CIOs get the message right this time. That is the reason why Microsoft has been investing so much in all kinds of additional components :<br /><ul><li>Office 2007 compatibility mode<li>Microsoft Office compatibility pack<li>Office migration manager<li>Office MOICE</ul><br /><br />By saying what they say, Microsoft is trying to keep the competition aside, despite the very vocal "open standards" claims. The "open standards" claims is actually a diversion. "open" is not in Microsoft Office DNA and culture. For instance, nobody in that team ever said publicly that the only reason we are dealing with VML in the file formats is because they rushed Office 2007 out the door (presumably to ship simultaneously with Windows Vista), and as a result they did not fully transition to DrawingML. All this debate about VML out there is entirely the result of Microsoft's own incompetence in shipping a finished product. Simply shameful because third parties now have to support both VML and DrawingML. There would not be such surprise with a company that does things "openly".<br /><br />In the period of time (that spans through a number of years) where Microsoft takes the risk of introducing new file formats, CIO could decide to just as well switch to an alternative Office suite for a fraction of the budget. Microsoft knows it very well, hence the media blitz. Hence the characterization of the alternative Office file formats such as ODF as simplistic. Simplistic is supposed to mean : not backwards compatible. That is what CIOs understand, and that is FUD.<br /><br />In this period of time, Microsoft wants to minimize the ratio of organizations switching to an alternative Office suite. To Microsoft Office, OpenOffice is Netscape. This is not a problem anymore when the new "OOXML" file formats reach critical mass. We are not there yet, but that's the plan : dry the competition as much as possible, which includes fake activities such as starting so-called open source projects. Something that is not exactly music to CIOs' hears, why should CIOs care? It's the opposite. When a CIO hears "open source project", he understands something other than <i>backwards compatible</i>, therefore unreliable stuff. The decision is a no brainer then.<br /><br />So what we are talking about here is a well orchestred media communication targeted to CIOs, not software developers. Until you get that, you don't understand.<br /><br /><br /><br /><H3>The consequences of this attitude</H3><br />Now that we know that Microsoft is not interested in building a modern Office document model (after all, there is already an international ISO/IEC standard for that : ODF), it gets easier to understand why OOXML is such a mess.<br /><br />It was recently brought up from people with vested interests in Office file formats that OOXML is a big mess :<br /><br /><ul><li><a href="http://www.robweir.com/blog/2008/03/disharmony-of-ooxml.html">Rob Weir</a> (IBM) : "The Disharmony of OOXML"<br /><table border=1><br /><tr><td><B>Format</B></td><td><B>Text Color</B></td><td><B>Text Alignment</B></td></tr><br /><tr><td>OOXML Text</td><td> <w:color w:val="FF0000"/></td><td> <w:jc w:val="right"/></td></tr><br /><tr><td>OOXML Sheet</td><td> <color rgb="FFFF0000"/></td><td> <alignment horizontal="right"/></td></tr><br /><tr><td>OOXML Presentation</td><td> <a:srgbClr val="FF0000"/></td><td> <a:pPr algn="r"/></td></tr><br /><tr><td>ODF Text</td><td> <style:text-properties fo:color="#FF0000"/></td><td> <style:paragraph-properties fo:text-align="end" /></td></tr><br /><tr><td>ODF Sheet</td><td> <style:text-properties fo:color="#FF0000"/></td><td> <style:paragraph-properties fo:text-align="end"/></td></tr><br /><tr><td>ODF Presentation</td><td> <style:text-properties fo:color="#FF0000"/></td><td> <style:paragraph-properties fo:text-align="end"/></td></tr><br /></table><br /><br /><li><a href="http://blogs.sun.com/GullFOSS/entry/ooxml_import_in_writer_a">Henning Brinkmann</a> (OpenOffice contributor) : "OOXML Import In Writer: A Shape Is a Shape, Is a Shape?"<br /><br /><i>When we started importing shapes from OOXML the Impress team already was able to import some shapes from OOXML files produced by PowerPoint 2007. This is DrawingML as described in chapter 5 of the Markup Language Reference for OOXML. But, if you use Word to insert a rectangle into a Word document (DOCX), you end up with VML as described in chapter 6 of the Markup Language Reference for OOXML. The Markup Language Reference tags VML as a deprecated format in OOXML, which is only included to the standard for backward compatibility reasons. Despite, Word 2007 uses VML to store shapes. (...) OOXML seems to be designed with the application model in mind. There may be different syntaxes for the same semantics, if it fits the already present application model better. But, if you want to create an alternative implementation for the format, this <b>introduces additional effort</b>.</i><br /></ul><br /><br />Fire and motion, anyone? I also gave a few examples of that as well in my introductory <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a> article where I showed that Excel SpreadsheetML uses no less than 6 ways to do basic text formatting, for no reason. If you do anything meaningful with those files, you need to implement both, which clearly means you'll be spending years doing so instead of concentrating on your own business.<br /><br />For anyone except Microsoft, it's fire and motion and it guarantees there won't be a substantial implementation of file formats before a number of years. That is exactly the same thing than the strategy behind with older Microsoft Office formats.<br /><br />Note that <b>Microsoft claims a number of third-parties applications support OOXML. But they never go into details such as how much support there is. For a reason that is trivial.</b><br /><br />To the CIO's hear, the fact that <i>backwards compatibility</i> is never associated in any shape or form to those <i>third-party applications</i> makes it very clear that they are not worth considering seriously. So there.<br /><br />Why is it a problem?<br /><br />Let's take Gnumeric. Gnumeric chose to implement some of OOXML by having import and export filters. This is a very important architecture decision. It means that the in-memory representation is not high-fidelity representation of the imported file until Gnumeric directly supports all tiny details in OOXML. Something that won't happen before a number of years. And it is therefore without much surprise that exporting the in-memory representation to say an .xlsx file will lose all features that Gnumeric does not directly support.<br /><br />To see the effect of that, simply try to open and save the following .xlsx file in Gnumeric : <a href="http://www.arstdesign.com/BBS/picsupload/Book2_s.xlsx">Book_2.xlsx</a><br /><br />Gnumeric's internal architecture is simply wrong : it should instead preserve the imported file. That is after all what XML fragments are for. It should be able to make in-place replacements of XML fragments and leave the rest of the file untouched. That's not what Gnumeric does. <b>Gnumeric is an example of an application that is not XML native</b>.<br /><br />Why is Microsoft making so much good press for Gnumeric and it's so-called rich OOXML support is left for one to guess...<br /><br />...especially when Office 2007 is not much XML native either. In fact, that is actually why we end up with so many versions of the truth in Office 2007 files. Why we have many different and incompatible ways to align a piece of text, or describe a vector shape, or text formatting is due to the following :<br /><br /><ul><li>It is a direct reflection of the fact that Microsoft Office engineering teams don't share any of their work across teams. The trivial example of that is that the only shared spec, Office drawing library (mso.dll), ends up being a mix of VML and DrawingML despite the fact that VML is only a markup version of MSO, whereas DrawingML is supposed to be MSO version next (internally known as E2O).<br /><br /><li>What Microsoft calls XML streams are in fact XML fragments where each individual fragment are angle-brackets around a stream of content that reflects the internal binary representation (and in fact, the binary records of older files) to be converted back and forth in full-fidelity. <br /><br />Let's say in one of those fragments, an attribute relates to a horizontal text alignment option, one value out of four, we end up with an attribute of the form <w algn="r"> where algn is one of r, l, c, j. In another fragment, a horizontal text alignment option is one out of eight, in which case we may have <a:alignment>left</a:alignment>, where the value is left or general, right, center, justify, distributed, fill or center across selection.<br /><br />Here is the problem. The fact that the text alignment has several distinct definitions in the same document is a bug. Engineering, peer-review and interoperability practices inherently put enough pressure to resist from having such thing done in the first place, that's why both alignment options should be the same. There is no reason to have more than one in the document. To standardize bugs is simply not excusable. It has to be stopped.<br /><br />The reason why we end up like this with OOXML is because it just reflects the bugs in the binary records, and that it works as if the Office team had developers writing compatibility code independently from each other, that is without ever giving a chance of code reuse. <br /><br />So, due to the fragments being just another representation of the binary records, <b>Office 2007 use of XML terminology is absolutely misleading. Press pass filled with the XML acronym all over the place are very appealing to CIOs, however</b>. It is more fair to say it's angle brackets around complex stuff than actual XML. It that were truly native XML, that would be factored in to maximize the reuse of it across libraries, components and applications. Just like ODF does.</ul><br /><br />So there you have it. It is without much surprise that Office 2007 does not ship with any XML tooling support.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-51609660757068270802008-03-13T02:13:00.000-07:002008-03-13T05:05:25.922-07:00Bad surprise in Microsoft Office binary documents : interoperability remains impossible<i>Stéphane Rodriguez, March 2008</i><br /><br />Previous articles :<br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/typical-bs-in-technical-articles-about.html">Typical B.S. in technical articles about OOXML</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br />- <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br />When Microsoft made a number of Microsoft Office binary documents available for <a href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">direct download</a> last month, I thought this was probably a good thing for those (including me) struggling with undocumented pieces with which only Microsoft can reliably migrate legacy documents back and forth. But this sentiment has long passed, what a disappointment.<br /><br /><H4>Direct download, a sharp change from the obligation to...send an email</H4><br /><br />In fact, this gift is a small change to what was being made available so far. If you read Microsoft knowledge base <a href="http://support.microsoft.com/kb/840817">article 840817</a>, in the binary document specification sections, it says :<br /><br /><blockquote>Microsoft Office Binary File Formats<br /><br />Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.<br /><br />If you want to receive the documentation, contact Microsoft at the <br />following e-mail address to initiate the agreement sign-up process:<br />officeff@microsoft.com (mailto:officeff@microsoft.com)</blockquote><br /><br />It is therefore a little disingenuous from Microsoft to make it as if those documents were not available.<br /><br /><H4>Forensic purposes</H4><br />The above specifically mentions <i>forensic purposes</i>, which is not the same thing than selling a product competing with Microsoft Office. It is unclear whether the new covenant not-to-sue terms are more open than those limiting <i>forensic purposes</i> claims, but clearly it's a problem. This specifically prevents competition.<br /><br /><br /><H4>Where is the VBA specification?</H4><br />Interestingly enough, you can compare what was available at the expense of an email, and what is now being made available under the <i>open specification promise</i>, where the covenant not-to-sue applies. In particular, even though VBA is listed in the Microsoft knowledge base article, <br /><br /><blockquote>Microsoft Office Binary File Formats<br /><br />Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not-to-sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and <b>Visual Basic for Applications (VBA) File Format for 2007 are also available</b> under this program.<br /><br />If you want to receive the documentation, contact Microsoft at the <br />following e-mail address to initiate the agreement sign-up process:<br />officeff@microsoft.com (mailto:officeff@microsoft.com)</blockquote><br /><br />the corresponding specification is not being made available under the <a href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">open specification promise</a> :<br /><br /><blockquote>Microsoft Word<br /><br />Word 97-2007 Binary File Format (.doc) Specification <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf">PDF</a> | <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.xps">XPS</a><br /><br />Microsoft PowerPoint<br /><br />PowerPoint 97-2007 Binary File Format (.ppt) Specification <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.pdf">PDF</a> | <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/PowerPoint97-2007BinaryFileFormat(ppt)Specification.xps">XPS</a><br /><br />Microsoft Excel<br /><br />Excel 97-2007 Binary File Format (.xls) Specification <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf">PDF</a> | <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.xps">XPS</a><br /><br />Excel 2007 Binary File Format (.xlsb) Specification <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel2007BinaryFileFormat(xlsb)Specification.pdf">PDF</a> | <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel2007BinaryFileFormat(xlsb)Specification.xps">XPS</a><br /><br />Office Drawing<br /><br />Office Drawing 97-2007 Binary Format Specification <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/OfficeDrawing97-2007BinaryFormatSpecification.pdf">PDF</a> | <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/OfficeDrawing97-2007BinaryFormatSpecification.xps">XPS</a><br /><br /><a href="">Supporting Technologies</a> (Windows structured storage, WMF, Ink)<br /></blockquote><br /><br /><br /><H4>Missing Excel binary BIFF specification</H4><br /><br />The Excel binary BIFF specification was made available until 1998, that was when Bill Gates decided that was no longer a good thing to make the file format specification available to third parties. Because the latest public document was made available in 1998, this document included the specification of BIFF8 records, i.e. used in Excel 97. No other specification update had ever been made available publicly, most notably none of the new records added in Excel 2000 (BIFF9), Excel XP (BIFF10), Excel 2003 (BIFF11), Excel 97-2003-2007 compatibility mode (BIFF11+) and Excel 2007 binary spreadsheets (BIFF12, with BIFF12 a new binary file format incompatible with previous BIFF versions) were documented, meaning that it was only possible at most to implement support for Excel file formats up to Excel 97. Not very friendly. But it gets worse.<br /><br />If you take a look at the BIFF binary document being made available under Microsoft's open specification promise, you'll notice it includes a patchwork of all BIFF versions. Why Microsoft chose to put everything together instead of making sections for each BIFF version is subject to discussion. One can argue that those interested in the very latest don't care when a particular record was introduced or updated. But to anyone willing to read and write BIFF records consistently with a target Excel version, it is impossible to know which record belongs to which version. It gets far worse. If you take a look at what was made available until 1998 in MSDN Library, you'll notice two sections for BIFF :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability1.gif"><br><br /><i>A screenshot of the old binary specifications (MSDN Library, Feb 1998)</i><br /><br />There are two big sections :<br /><ul><li>BIFF main records : cells, pivot tables, formatting styles<li>BIFF chart records</ul><br /><br />Here is the list of chart records (excerpt from MSDN Library, Feb 1998) :<br /><pre><br />Chart BIFF Records: Alphabetical Order<br /><br />Number<br /> Record<br /> <br />103A<br /> 3D: Chart Group Is a 3-D Chart Group<br /> <br />1051<br /> AI: Linked Data<br /> <br />1050<br /> ALRUNS: Text Formatting<br /> <br />101A<br /> AREA: Chart Group Is an Area Chart Group<br /> <br />100A<br /> AREAFORMAT: Colors and Patterns for an Area<br /> <br />100C<br /> ATTACHEDLABEL: Series Data/Value Labels<br /> <br />1062<br /> AXCEXT: Axis Options<br /> <br />1046<br /> AXESUSED: Number of Axes Sets<br /> <br />101D<br /> AXIS: Axis Type<br /> <br />1021<br /> AXISLINEFORMAT: Defines a Line That Spans an Axis<br /> <br />1041<br /> AXISPARENT: Axis Size and Location<br /> <br />1017<br /> BAR: Chart Group is a Bar or Column Chart Group<br /> <br />1033<br /> BEGIN: Defines the Beginning of an Object<br /> <br />1061<br /> BOPPOP: Bar of Pie/Pie of Pie Chart Options<br /> <br />1067<br /> BOPPOPCUSTOM: Custom Bar of Pie/Pie of Pie Chart Options<br /> <br />1020<br /> CATSERRANGE: Defines a Category or Series Axis<br /> <br />1002<br /> CHART: Location and Overall Chart Dimensions<br /> <br />1014<br /> CHARTFORMAT: Parent Record for Chart Group<br /> <br />1022<br /> CHARTFORMATLINK: Not Used<br /> <br />101C<br /> CHARTLINE: Drop/Hi-Lo/Series Lines on a Line Chart<br /> <br />1063<br /> DAT: Data Table Options<br /> <br />1006<br /> DATAFORMAT: Series and Data Point Numbers<br /> <br />1024<br /> DEFAULTTEXT: Default Data Label Text Properties<br /> <br />103D<br /> DROPBAR: Defines Drop Bars<br /> <br />1034<br /> END: Defines the End of an Object<br /> <br />1060<br /> FBI: Font Basis<br /> <br />1026<br /> FONTX: Font Index<br /> <br />1032<br /> FRAME: Defines Border Shape Around Displayed Text<br /> <br />1066<br /> GELFRAME: Fill Data<br /> <br />104E<br /> IFMT: Number-Format Index<br /> <br />1015<br /> LEGEND: Legend Type and Position<br /> <br />1043<br /> LEGENDXN: Legend Exception<br /> <br />1018<br /> LINE: Chart Group Is a Line Chart Group<br /> <br />1007<br /> LINEFORMAT: Style of a Line or Border<br /> <br />1009<br /> MARKERFORMAT: Style of a Line Marker<br /> <br />1027<br /> OBJECTLINK: Attaches Text to Chart or to Chart Item<br /> <br />103C<br /> PICF: Picture Format<br /> <br />1019<br /> PIE: Chart Group Is a Pie Chart Group<br /> <br />100B<br /> PIEFORMAT: Position of the Pie Slice<br /> <br />1035<br /> PLOTAREA: Frame Belongs to Plot Area <br /> <br />1064<br /> PLOTGROWTH: Font Scale Factors<br /> <br />104F<br /> POS: Position Information<br /> <br />103E<br /> RADAR: Chart Group Is a Radar Chart Group<br /> <br />1040<br /> RADARAREA: Chart Group Is a Radar Area Chart Group<br /> <br />1048<br /> SBASEREF: PivotTable Reference<br /> <br />101B<br /> SCATTER: Chart Group Is a Scatter Chart Group<br /> <br />105B<br /> SERAUXERRBAR: Series ErrorBar<br /> <br />104B<br /> SERAUXTREND: Series Trendline<br /> <br />105D<br /> SERFMT: Series Format<br /> <br />1003<br /> SERIES: Series Definition<br /> <br />1016<br /> SERIESLIST: Specifies the Series in an Overlay Chart<br /> <br />100D<br /> SERIESTEXT: Legend/Category/Value Text<br /> <br />104A<br /> SERPARENT: Trendline or ErrorBar Series Index<br /> <br />1045<br /> SERTOCRT: Series Chart-Group Index<br /> <br />1044<br /> SHTPROPS: Sheet Properties<br /> <br />1065<br /> SIINDEX: Series Index<br /> <br />103F<br /> SURFACE: Chart Group Is a Surface Chart Group<br /> <br />1025<br /> TEXT: Defines Display of Text Fields<br /> <br />101E<br /> TICK: Tick Marks and Labels Format<br /> <br />1001<br /> UNITS: Chart Units<br /> <br />101F<br /> VALUERANGE: Defines Value Axis Scale<br /><br /></pre><br /><br />If you take a look at what Microsoft makes available under the covenant not-to-sue, the <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf">BIFF documentation</a> and all others, you'll soon realize that chart records are not included. In other words, a third-party can only, at best, implement support for Excel BIFF spreadsheets with no charts. Not exactly good news!<br /><br />It actually gets worse.<br /><br /><br /><H4>A number of records remain absent</H4><br /><br /><H5>Obvious misses</H5><br /><br />Records such as those identified by 0x00EF and 0x01BA remain absent from the documentation. Those are records automatically created when a new Excel spreadsheet gets created with any Excel 97 (or more recent) version. These are obvious records, but a number of others are absent.<br /><br /><br /><H5>Trickier misses</H5><br /><br />Function byte codes are not documented. For instance, if you use a function such as <code>CEILING(number, significance)</code>, you have no way to know with the documentation what bytecode this function gets encoded to since the table of function bytecodes isn't provided. There are over 300 built-in functions.<br /><br />Even trickier is the fact that, in order to read/store or migrate a function through an extension mechanism (NAME or EXTERNALNAME records), or to build an Excel-like application, the actual function name string is used, but :<br /><ul><li>the list of function names is not in the documentation (note : it is given in another document, ECMA 376)<li>the list of <b>actually localized</b> function names isn't provided anywhere. (note : ECMA 376 lists the functions in English-US only)</ul><br /><br />It is problematic for a third-party not to have those localized function names for instance, since there is no way an application will be able to support such functions like Excel does without that. In fact, in order to know the names, you'll have to purchase <a href="http://buy1.trymicrosoftoffice.com/buyus/product.aspx?family=langpack&culture=en-US">Microsoft Office language packs</a> :<br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability2.jpg"><br><br /><i>Purchasing Office language packs, the only way to know the localized function names in Excel</i><br /><br />A language pack is $24.95 each, and there are 36 of them! ($900 in total). Anytime you install a language pack, you can open a spreadsheet where the function of interest will appear under its localized name.<br /><br />It's interesting that Microsoft claims to be committed to an open standard like that! If you intend to support Excel functions in any meaningful way, don't forget to include a $900 budget.<br /><br /><br /><br /><H4>Many non-trivial records remain partially documented</H4><br /><br />Some records are missing, but even documented records remain partially documented records. Here are two simple examples :<br /><br /><ul><li><a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf">BIFF documentation</a>, OBJ record, page 153,<br /><br /><blockquote><b>ftCblsData (12h) check box data</b><br><br /><i>Offset</i> | <I>Name</I> | <I>Size</I> | <I>Contents</I><br><br />0 | ft | 2 | =ftCblsData (12h)<br><br />2 | cb | 2 | Length of ftCblsData<br><br />4 | (Reserved) | var | Reserved</blockquote><br /><br />The check box data remains undocumented. And so does most of internal objects stored inside the OBJ records. The OBJ records holds data for Form controls (listbox, edit, ...).<br /><br /><li><a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/OfficeDrawing97-2007BinaryFormatSpecification.pdf">Office drawing documentation</a>, page 7,<br /><br /><blockquote>Associated with each shape is a piece of client data that keeps the shape‘s anchor, text and OLE data, as well as host specific properties. The format of this structure is host-defined.</blockquote><br /><br />And indeed, the msofbtClientData (F011) record isn't documented. Page 12, we learn :<br /><br /><blockquote><b>msofbtClientData (F011)</b><br />host-defined<br />host-specific data</blockquote><br /></ul><br />Note that since this record has no section linked to it in the documentation, unlike a number of other records, it should be obvious for anyone from Microsoft to verify that this record isn't documented. This is unfortunate since this is an important record.<br /><br />The two examples are a reflection of problems throughout all documents. What is meant by that is that, from Microsoft point of view, those documents may be "reference" documents in the sense that if you are taking a look at the corresponding source code, those documents are handy. But only they can do that. For everybody else, those documents are <b>hints</b>, not <b>actual references</b>. It is fire and motion, and it's been going on for the best part of two decades.<br /><br /><br /><H4>The migration scenario : here comes Microsoft's secret sauce</H4><br /><br />Let's take two examples showing how impossible a task it is to map binary records to XML elements. In this case, it will be calculation chains in spreadsheets.<br /><br /><H5>The return of the calculation chains</H5><br /><br />If we do a full text search about calculation chains over the documents Microsoft has made available, we find two occurences in the <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf">BIFF documentation</a> :<br /><br /><ul><li>page 35 : ARRAY record, "chn, 4 bytes" ; Ignore the chn field when reading the BIFF file. If a BIFF file is written, the chn field must be 00000000h.<br /><li>page 123 : FORMULA record, "chn, 4 bytes" ; The chn field should be ignored when you read the BIFF file. If you write a BIFF file, the chn field must be 00000000h.</ul><br /><br />That's about it. There is no actual definition and encoding algorithm.<br /><br />In ECMA 376, part 4, a full text search about calculation chains returns the following :<br /><br /><ul><li>page 2084, section 3.6 calculation chain ; The cells in a workbook can be calculated in different orders depending on various optimizations and dependencies. The calculation chain specifies the order in which the cells in a workbook were last calculated. The calculation chain only deals with cells that require calculation - i.e., it only deals with cells that contain formulas. It does not track or express dependencies amongst the formulas, but rather only records the order in which the cells were last calculated. (...) Since the xml data described in this section is not strictly required, the spreadsheet application is free to ignore the order in which the calculation chain specifies calculations.<br /></ul><br /><br />And that "definition" is followed by an example.<br /><br />So here are the problems :<br /><br /><ul><li>How do you map the "chn" fields to a calculation chain ? In other words, how do you decode these?<br /><li>Since "chn" fields are attached to formula and array records (in binary files), and that the calculation chain part is not in any way associated to the spreadsheet sheet parts (in SpreadsheetML files), it means a mapping needs it occur. What is it?<br /><li>Both documentations make it clear that the calculation chain fields and calculation chain parts are essentially optional. But in practice, this is far from being the truth. In a previous <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">article</a>, I described what happens if you change the value in a cell without touching the calculation chain part : the spreadsheet is seen by Excel 2007 as a corrupt spreadsheet, it self explodes. On the other hand, deleting the calculation chain part will force a recalculation next time the spreadsheet is opened by the user, which is arbitrarily lengthy. If Excel is the only application that can ever avoid making recalculations, because it knows the secret sauce about calculation chains, no other application can be seen as an equivalent application. Quite the opposite, Excel will be seen as a first-class citizen, and everything else will be seen as simplistic (i.e. unreliable) spreadsheet applications.<br /></ul><br /><br /><br /><H5>Caching values to show drop lines in charts!</H5><br /><br />In binary files, adding a dropline to a chart is managed by a simple chart related record, CHARTLINE. Note that since chart records are not documented in the <a href="http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Excel97-2007BinaryFileFormat(xls)Specification.pdf">BIFF documentation</a>, the following CHARTLINE definition comes from MSDN Library, Feb 1998 :<br /><br /><blockquote><B>CHARTLINE: Specifies Drop/Hi-Lo/Series Lines on a Line Chart (101Ch)</B><br /><br />This record specifies drop lines, hi-lo lines, or series lines on a line chart. If the chart has both drop lines and hi-lo lines, two CHARTLINE records will be present.<br /><br />Record Data<br /><br /><I>Offset</I> | <I>Name</I> | <I>Size</I> | <I>Contents</I><br />4 | id | 2 | Drop lines/hi-lo lines. 0 = drop lines 1 = hi-lo lines 2 = series lines (the lines that connect the columns in a stacked column chart)<br /></blockquote><br /><br />The way droplines are added is by having such record present or not. Everything else in the chart record stream is independent of the dropline.<br /><br />Now let's do the following :<br /><br /><ul><li>start Excel 2007<br /><li>click on B2 and enter : jan, hit <TAB>, 3, <TAB>, 4, <TAB>, 5<br /><li>select those cells, click on the Insert ribbon tab and click on the line chart button. Select whatever line chart type. The line chart shows up. <br /><li>click on the Layout ribbon tab, then on Analysis, then Lines, then Drop lines. The drop lines shows up. You should see the following (the droplines are the vertical segments) :<br /><center><img border=1 src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability2.gif"><br><br /><i>Adding droplines to a chart</i></center><br><br /><li>save this file as Droplines.xlsx, close it, quit Excel 2007.<br /><li>unzip the file and take a look at the xl/charts/chart1.xml part. You should see the following :<br /><center><img border=1 src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability3.gif"><br><br /><i>The XML droplines element is indeed added to the chart description</i></center><br><br /><li>The <c:dropLines> element is the exact equivalent of the binary CHARTLINE record. Or so it seems. now edit this part and remove the <c:strCache> and <c:numCache> elements, which are cached values. It should result in the following :<br /><center><img border=1 src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability4.gif"><br><br /><i>Removing cached values in a chart description</i></center><br><br /><li>put the edited part back in the zip file and open it in Excel 2007. The following appears :<br /><center><img border=1 src="http://www.arstdesign.com/BBS/picsupload/OfficeInteroperability5.gif"><br><br /><i>Droplines are gone!</i></center><br><br /></ul><br /><br />Droplines are gone! In other words, droplines in XML charts are linked to cached values, contrary to how droplines used to work in binary spreadsheets (namely, cached values are not needed). Not only is this not documented anywhere, it puts a huge burden on implementers who now have, at minimum, to actually compute and store a cache of values from the chart data source only to be able to show droplines, despite the fact that the XML description is designed in a way such that the presence of element <c:dropLines> is all what is expected to add droplines to a chart.<br /><br />Here are the Excel 2007 files (.xslx) :<br /><ul><li>Chart+droplines, <a href="http://www.arstdesign.com/BBS/picsupload/Book1_droplines_cachedvalues.xlsx">original .xlsx file</a> (i.e. with cached values)<li>Chart+droplines, <a href="http://www.arstdesign.com/BBS/picsupload/Book1_droplines_withoutcachedvalues.xlsx">manually updated .xlsx file</a> (i.e. cached values removed)</ul><br /><br />And here are the binary files (.xls) :<br /><ul><li>Chart+droplines, <a href="http://www.arstdesign.com/BBS/picsupload/Book1_droplines_cachedvalues.xls">original .xls file</a> (i.e. with cached values)<li>Chart+droplines, <a href="http://www.arstdesign.com/BBS/picsupload/Book1_droplines_withoutcachedvalues.xls">manually updated .xls file</a> (i.e. cached values removed)</ul><br /><br />In a binary file, cached values are stored in the chart footer, as follows :<pre><br />[DIMENSIONS 000E] 00 00 00 00 03 00 00 00 00 00 01 00 00 00<br />[1065 0002] 02 00<br />[1065 0002] 01 00<br />[NUMBER 000E] 00 00 00 00 00 00 00 00 00 00 00 00 08 40<br />[NUMBER 000E] 01 00 00 00 00 00 00 00 00 00 00 00 10 40<br />[NUMBER 000E] 02 00 00 00 00 00 00 00 00 00 00 00 14 40<br />[1065 0002] 03 00<br /></pre><br /><br /><br /><br /><H4>Conclusion</H4><br /><br />A couple of counter examples have demonstrated that Microsoft Office document interoperability is wishful thinking at this point. The documents made available by Microsoft for direct download won't and shouldn't impress third-party implementers since it does not help much.<br /><br />What is being shown is that in addition to missing documentation, the binary documentations sometimes conflict with the ECMA 376 documentation, itself not a full documentation of the new XML-based formats anyway.<br /><br />Who other than Microsoft is able to reliably migrate formats from binary to XML, and back ?<br /><br />Microsoft seems to have anticipated the issue by creating...an open source project in Feb 15 whose task is to migrate files. But it's unclear that, short of rewriting the existing Microsoft Office source code, which could take a decade, how this project is supposed to achieve any meaningful goal. Obviously, ISO national bodies should take this open source project with a grain of salt...<br /><br />A national body should probably ask instead Microsoft to <b>open up the source code of the Microsoft Office compatibility pack</b> since, after all, that is exactly what this component does. It migrates Microsoft Office documents back and forth, contains what's needed to migrate calculation chains, droplines, etc. and is therefore a perfect candidate for third-party implementations. Even better, Microsoft can make it available now!OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-55326230530200060262008-02-27T09:20:00.000-08:002008-02-27T09:48:55.384-08:00Typical B.S. in technical articles about OOXML<span style="font-style:italic;">Stéphane Rodriguez, February 2008</span><br /><br />Previous articles :<br /><a href="http://ooxmlisdefectivebydesign.blogspot.com/2008/02/truth-about-microsoft-office.html">The truth about Microsoft Office compatibility</a><br /><a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">OOXML is defective by design</a><br /><br /><br /><br />Microsoft has posted an article on their community website related to OOXML. That <a href="http://openxmldeveloper.org/articles/2683.aspx">technical article</a> is supposed to compare OOXML's DrawingML markup for drawing objects and SVG, an open standard (W3C) developed by independent parties for drawing objects.<br /><br />That article tries to appear neutral and fair, even though the author uses plenty of hyperbole to paint OOXML's DrawingML markup as the best thing since sliced bread, and demonize SVG as a simplistic markup, a point of view strikingly similar to Burton Group's study that painted OOXML as a wonderful file format, and demonized ODF as a simplistic file format. (we know Burton Group is a Microsoft client and business partner).<br /><br />I recommend to read the pointed article before reading further.<br /><br />To say it is a mess is a euphemism.<br /><br /> It is supposed to be a technical comparison, but the author never bothers mention what kind of comparison he's trying to make. There are indeed several comparison types :<br /> - strict markup and semantics compatibility<br /> - simple mapping to achieve markup and semantics compatibility<br /> - blocking incompatibilities<br /><br /> Even in the blocking incompatibilies case, since DrawingML and SVG are both based on XML, there cannot be core incompatibilities preventing scenarios such as round-tripping. Indeed, the point of using XML is that you can always add missing markup from one language into another in order to account for the missing markup and semantics.<br /><br /> With that said, in the introduction, the document incorrectly defines SVG, portrayed as "something for the web", and uses this mischaracterization in the conclusion to actually show his affinity towards DrawingML : "a richer and nicer fit for Office documents".<br /><br /> The simple fact that the author uses "Office documents" instead of "Microsoft Office documents" is telling how little the author is willing to take into account the needs of general purpose Office documents, as opposed to a single vendor's Office document model.<br /><br /> What that document <b>actually reflects is the author's own biased knowledge of the domain at hand</b>.<br /><br /><br /> 1) The SVG definition : what it's about<br /><br /> According to the author, "<i>Scalable Vector Graphics (SVG) is a graphics file format and Web development language based on XML. With SVG, user can describe the output (i.e. look and feel) of the content. SVG describes the common and extended feature set of today's graphical authoring environments, both tools and program. SVG is a common export format in today's graphical authoring environments. SVG is a language for describing two-dimensional graphics in XML. It allows three types of graphic objects: vector graphic shapes (e.g., paths consisting of straight lines and curves), images and text. With SVG, developers can create Web applications based on data-driven, interactive, and personalized graphics.</i>"<br /><br /> On the W3C website (http://www.w3.org/Graphics/SVG/About.html), SVG however is defined as follows :<br /><br /> "<i>SVG is a language for describing two-dimensional graphics and graphical applications in XML. SVG 1.1 is a W3C Recommendation and forms the core of the current SVG developments. SVG Tiny 1.2 is the specification currently being developed as the core of the SVG 1.2 language (comments welcome). The SVG Mobile Profiles: SVG Basic and SVG Tiny are targeted to resource-limited devices and are part of the 3GPP platform for third generation mobile phones. SVG Print is a set of guidelines to produce final-form documents in XML suitible for archiving and printing.<br /><br /> Furthermore,<br /><br /> SVG is a platform for two-dimensional graphics. It has two parts: an XML-based file format and a programming API for graphical applications. Key features include shapes, text and embedded raster graphics, with many different painting styles. It supports scripting through languages such as ECMAScript and has comprehensive support for animation.<br /><br /> SVG is used in many business areas including Web graphics, animation, user interfaces, graphics interchange, print and hardcopy output, mobile applications and high-quality design.<br /><br /> SVG is a royalty-free vendor-neutral open standard developed under the W3C Process. It has strong industry support; Authors of the SVG specification include Adobe, Agfa, Apple, Canon, Corel, Ericsson, HP, IBM, Kodak, Macromedia, Microsoft, Nokia, Sharp and Sun Microsystems. SVG viewers are deployed to over 100 million desktops, and there is a broad range of support in many authoring tools.<br /><br /> SVG builds upon many other successful standards such as XML (SVG graphics are text-based and thus easy to create), JPEG and PNG for image formats, DOM for scripting and interactivity, SMIL for animation and CSS for styling.<br /><br /> SVG is interoperable. The W3C release a test suite and implementation results to ensure conformance.</i>"<br /><br /><br /> As the definition tells, SVG can be seen as an ongoing work with a core set of markup to which an arbitrary number of so-called profiles can be attached to target specific devices for instance. i.e.<br /><br /> SVG Basic<br /> SVG Tiny<br /> SVG Mobile<br /> SVG Print<br /> ...<br /><br /> SVG Basic itself uses fundamental standards such as CSS and Javascript to provide styling as well as programmability.<br /><br /> This definition provides ground for more profiles, such as an hypothetical "SVG chart", a profile that would define a charting package. In other words, <b>SVG is fundamentally extensible to accomodate today's and tomorrow's needs</b>.<br /><br /> Contrary to what the author claims, SVG is not specifically for the web.<br /><br /><br /><br /> 2) DrawingML's definition.<br /><br /> In ECMA 376, Part 1<br /><br /> "<i>DrawingML specifies the location and appearance of drawing elements in a package. For example, these elements could be, but are not limited to, shapes, pictures, and tables. The root element of a DrawingML XML fragment specifies the presence of a drawing at this location in the document.<br /> A shape is a geometric object such as a circle, square, or rectangle; a picture is an image presented inside the document; and a table is a two-dimensional grid of cells organized into rows and columns. Cells and whole tables can have associated properties. A cell can contain text, for example. DrawingML also specifies the location and appearance of charts in a package. The root element of a chart part is chart, and specifies the appearance of the chart at this location in the document. In addition, DrawingML specifies package-wide appearance characteristics, such as the package's theme. The theme of a document specifies the color scheme, fonts, and effects, which can be referenced by parts of the document—such as text, drawings, charts, and diagrams—in order to create a consistent visual presentation.<br /> A chart is a presentation of data in a graphical fashion, such as a pie chart, bar chart, line chart, in order to make trends and exceptions in the data more visually apparent.</i>"<br /><br /> This definition makes it very clear that, in constrast with SVG, <b>DrawingML ties together hardcoded drawing object representations that no matter how they suit a particular vendor's product, entrenches anyone who would rely on DrawingML</b>.<br /><br /> Just like elsewhere in ECMA 376, DrawingML is not based on any ISO standard. For instance, even though SVG can be tied to ISO-based ICC color profiles, used by industrial manufacturers, DrawingML colors are just the reflection of a vendor's own internal API.<br /><br /> No matter how quote, "rich and nice", end quote, this internal API is, it cannot compare with markup that can be adapted to arbitrary independent party profiles.<br /><br /> Since SVG is being built by a number of independent parties, whereas DrawingML is built by a single vendor, comparing SVG to DrawingML only contributes to putting the light on DrawingML proprietary developments. It actually goes against portraying DrawingML as a good thing to have.<br /><br /><br /> 3) Comparisons<br /><br /> 3.1 Given the extensibility nature of SVG, there is no discussion whether SVG can support all of DrawingML elements. It can. It's a direct consequence of XML-based markup.<br /><br /> Now, that would be for markup compatibility.<br /><br /> 3.2 Another comparison type is simple mapping to achieve markup and semantics compatibility. By that, the author would want to know if there is a way to map any DrawingML fragment to any *EXISTING* SVG fragment. But even that draws a few cases : which way, and is it both way?<br /> - which way : if we have a specific markup element in DrawingML that we are willing to represent in existing SVG markup, this task is bound to fail if we are trying to map intricacies of DrawingML as opposed to general purpose drawing markup. For instance, in section 3.1.2.2 of that document, the author tries to match the DrawingML's a:comp markup with an equivalent in the existing SVG markup. And fails to find it. Well, there are good reasons for that. First of all, it puts a <b>severe burden on implementers who have to compute and maintain color stacks</b> (to resolve a:comp into an actual color when it comes to rendering that element), which is probably not desirable on devices where SVG is supposed to be thin and fast. For instance, if SVG is used for printing purposes, the printer driver using the equivalent of a:comp would have to compute and maintain color stacks. It is not sure that it is in either the printer manufacturer's interest to do so (implementation tax), or to manufacturers in general to do so. A preferred way is to use SVG profiles which provide room for selecting explicit colors among alternate lists (CSS based, most notably). What this allows is cheaper device implementation support of SVG based documents.<br /><br /> While this example of the a:comp markup is just one example, the entire document goes like this using the same logic.<br /><br /> - the other way : since DrawingML is hardcoded, an arbitrary number of SVG fragments may not be represented in DrawingML with markup and semantics compatibility. This problem is directly tied to the fact that 1) DrawingML is not based on SVG, it defines his own world 2) DrawingML lacks a number of SVG concepts and therefore cannot directly map markup and semantics. And there is one problem, <b>today's Office 2007 runtime does not allow ANY FOREIGN XML ELEMENT in any of the core ECMA 376 languages, including DrawingML</b>. In other words, if you take a DrawingML fragment, and add a few markup to tag it with information that defines some SVG concept from a SVG fragment, that would be seen by Office 2007 as a case of <b>corrupt document</b>.<br /><br /> - both ways : these are both cases above combined.<br /><br /><br /> 3.3 The last case of comparison is strict markup and semantics compatibility. This case is automatically bound to fail since DrawingML was developed by a single vendor and is not based on other standards.<br /><br /><br /><br /> 4) Conclusion<br /><br /> It does not demonstrate anything other than :<br /><br /> - there is no direct mapping between DrawingML and SVG, and that has the unfortunate consequence of pointing out the single vendor's proprietary design that went into DrawingML. Should the single vendor have designed DrawingML on top of SVG, the situation would be drastically different.<br /><br /> - the lack of the author's own understanding of the fundamental reason why people use XML markup : in essence you can add all the missing markup from one language into another to satisfy scenarios such as round-tripping (DrawingML ==> SVG ==> DrawingML) without loss.<br /><br /><br /><br />Some additional background information. The author of that document is Sheela E N, an employee of Sonata software, an Indian consulting company. Sonata software has made a name for themselves since 2006 by contracting with...Microsoft for helping them test a Microsoft-sponsored open source project ODF-OOXML, under the guidance of French-based CleverAge, and a number of other consulting companies. Those ties define a biased source (directly paid) when it comes to discussing Microsoft file formats.OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-44144085768347664922008-02-08T13:24:00.000-08:002010-01-09T01:45:42.231-08:00The truth about Microsoft Office compatibility<i>Stéphane Rodriguez, February 2008</i><br /><br /><p> </p><br /><br /><p>The Microsoft Office product team rationale for designing such a <a href="http://ooxmlisdefectivebydesign.blogspot.com/2007/08/microsoft-office-xml-formats-defective.html">poorly engineered</a> file format known as OOXML is that, according to public statements they made, they do this not to advance the state of the art in Office document models, but to bring the two-decade worth of legacy and bugged if not broken features into the future.</p><br /><br /><p>They have a commercial reason to do so. According to their numbers, there is a 400,000,000 user install base. But with the introduction of the new file format OOXML and the accompanying application that they rushed out the door in order to ship in line with Windows Vista, is it true that Office 2007 is Microsoft's best thing since sliced bread when it comes to compatibility?</p><br /><br /><H4>1) There is no Office 2007 64-bit edition</H4><H4>2) No support for 64-bit addins, user defined functions, existing ActiveX controls, OLE servers or even managed code (.NET).</H4><H4>3) No support for VBA in Excel services, part of their server suite</H4><H4>4) No support for VBA in Office 2008, the Mac version</H4><H4>5) Reduced speed of Office 2007 on 64-bit operating systems</H4><H4>6) Limited memory space of Office 2007 on 64-bit operating systems</H4><H4>7) Reduced legacy code compatibility of Office 2007 on 64-bit operating systems</H4><H4>8) OOXML supports 64-bit at all?</H4><H4>9) An open standard excluding competitors</H4><H4>10) Reverse engineering a necessity for compatibility reasons</H4><H4>11) No compliant file format</H4><br /><br /><br /><p> </p><br /><H3>1) There is no Office 2007 64-bit edition</H3><br /><br /><p>As amazing as it sounds, Microsoft didn't bother ship a 64-bit version of Office 2007, a version that would run natively on one of the 64-bit operating systems that they have made available on Windows (XP 64 edition, 2003 x64 edition, 2003 Itanium edition, 2008 server).</p><br /><br /><p>64-bit computers have become mainstream, and demand is growing very fast.</p><br /><br /><p>Microsoft mentions it on their <a href="http://office.microsoft.com/en-us/products/HA101668651033.aspx">website</a> (Office system requirements) : </p><br /><blockquote><i>System requirements overview<br><br /><br><br />The 2007 Microsoft Office system programs client is a <b>32-bit application</b> and can run on a Windows 64-bit platform (Windows XP, Windows Server 2003, and Windows Vista) but <b>there may be some feature limitations</b> as noted in the system requirements below. (...)</i></blockquote><br /><br /><p>The company that ships both Office and Windows couldn't find a good reason to provide a native support to their customers using 64-bit operating systems. Customers will have to run in emulated mode, otherwise known as WOW64 so that Office runs at all.</p><br /><br /><p>A number of limitations make it impossible once for all to use this product with the confidence that existing applications and add-ins work as intended. Despite compatibility claims. To add insult to injury, a number of performance issues arise as well, making hard to talk about compatibility without sounding like a liar. If you are in the process of purchasing Microsoft software, you may want to hold back a bit and perhaps consider an alternative.</p><br /><br /><img src="http://www.arstdesign.com/BBS/picsupload/OfficeCompatibilityLies1.gif"><br><br /><i>The Microsoft Office compatibility matrix (true for all Office versions)</i><br><br /><br /><p> </p><p>We are going to get in such details.</p><br /><br /><p> </p><br /><H3>2) No support for 64-bit addins, user defined functions, existing ActiveX controls, OLE servers or even managed code (.NET).</H3><br /><br /><p>Microsoft has not touched VBA, the scripting language that so many business users are fond of, for a number of years. In fact, it's been 7 years that they simply chose to outsource the maintenance of VBA (i.e. no more development) to Summit Software, a consulting company. VBA is at the heart of the so-called <i>Office platform</i>. It's what appeals to customers in need of custom applications built around Office. The fact that VBA is not being developed anymore should give anyone the assurance that anytime Microsoft ships a revision of Office, all applications based on VBA automatically work. Well, there is one problem : there isn't a 64-bit version of VBA.</p><br /><br /><p>If you are using a VBA-based third-party solution that depends on anything native of the 64-bit operating system, and get it loaded as part of Office, it simply won't work.</p><br /><br /><p>In retrospect, one has to ask, is the lack of 64-bit VBA the reason why Microsoft does not ship a 64-bit edition of Office ? This question is worth asking. If the answer is yes, it means Microsoft is lagging behind its own technology, and they are unable to meet the needs of anyone using a 64-bit operating system to take advantage of the 64-bit improvements over 32-bit.</p><br /><br /><p>What applies to VBA, applies to all user defined functions and even managed code (.NET) as well. No 64-bit user defined function or 64-bit managed code assembly can be loaded as part of a 32-bit image (Winword.exe, Excel.exe, Powerpoint.exe). As a result, your solutions and third-party solutions cannot take advantage of Office and the 64-bit operating system advantages simultaneously.</p><br /><br /><p> </p><br /><H3>3) No support for VBA in Excel services, part of their server suite</H3><br /><br /><p>Microsoft ships Excel in two editions, a desktop edition (Excel 2007) and a server edition (Sharepoint server 2007 which includes a server, Excel services). Even though Microsoft ships a 64-bit edition of the server suite, you still cannot use VBA as part of Excel services.</p><br /><br /><p>Why it is so? Simply VBA was designed as a <i>STA COM apartment</i>, Microsoft novlang to mean <i>single-threaded</i>. Excel services is designed to distribute calculations and serve users across threads (fundamentally multi-threaded architecture) and machines, so a single-threaded component cannot be part of this architecture.</p><br /><br /><p>The consequence is, should you consider using Excel services instead of the desktop of Excel to meet your needs, you automatically lose the ability to open reliably any spreadsheet embedding solutions and third-party solutions using VBA. For instance, <i>Financial related add-ins</i> shipping by default with Excel 2007 won't run.</p><br /><br /><p> </p><br /><H3>4) No support for VBA in Office 2008, the Mac version</H3><br /><br /><p>When Microsoft shipped the Mac version of Office 2007 in January this year, known as Mac Office 2008, it had no qualms taking away the support for VBA. Making any VBA-based solution broken on the Mac.</p><br /><br /><p>In the top 5 issues of the Microsoft <a href="http://www.microsoft.com/mac/help.mspx?target=e2088fea-cdb6-47ec-af72-3b675d64a4f71033&clr=99-1-0&ep=11&CTT=PageView&MODE=pv&locale=en-US&usid=5b53a598-b16d-4add-9156-c1398603b5ba">website</a> related to Mac Office 2008, it says :<br /><br /><blockquote><i>My Visual Basic macros don't work<br><br /><br><br />Cause: Office 2008 for Mac cannot run Visual Basic macros or load add-ins that contain Visual Basic macros.<br><br /><br><br /><br><br /> Solution: Keep the macro in the file. <br><br /> Solution: Remove the macro from the file. <br><br /> Solution: Save the macro in another macro-enabled file format. <br><br /> Solution: Create a new macro by using AppleScript. </i><br /></blockquote><br /><br /><p>Apparently, it does not matter so much that the ease and essence of VBA-embedded macros in documents employed by business users met their needs. Remember, VBA macros are a clever way for users to customize solutions without having IT people interfering. A remarkable feature is that, by embedding VBA macros, the deployment of solutions is trivial and does not involve the hassles of IT department. A sharp contrast with .NET, Microsoft's proposal for the future, which uses external files (assemblies) and additional rules for running : impossible to deploy without IT people.</p><br /><br /><p> </p><br /><H3>5) Reduced speed of Office 2007 on 64-bit operating systems</H3><br /><br /><p>When an application runs in WOW64, the emulation mode, it marshals memory access back and forth to adjust pointer sizes. Thereby creating a performance hit. Especially crucial for Word pagination, Excel calculations and Powerpoint animations.</p><br /><br /><p>Compatibility is therefore somewhat subjective. If your spreadsheet used to calculate within 10 minutes, and now it take twice the time, it's clear the in companies where speed is crucial, it is regarded as a bug that must be fixed.</p><br /><br /><p>One of the advantages of the 64-bit CPU is that whenever an application is natively 64-bit it maximizes the use of the increased register set. But when the application is running in WOW64, it has to restrict itself with the typical 32-bit CPU registers, and therefore unable to take advantage of the speed increase obtained by an increased register sets (as a result of less stacking).</p><br /><br /><p> </p><br /><H3>6) Limited memory space of Office 2007 on 64-bit operating systems</H3><br /><br /><p>Another advantage of a 64-bit operating system and CPU is the ability to run processes using a 64-bit wide address space. We are talking an infinite memory space, so to speak. But in 32-bit, and thefore in the WOW64 emulation mode, none of that is possible and a running process is restricted to 4GB of memory space, with 2GB preempted by the kernel.</p><br /><br /><p>Expecting to work with larger spreadsheets or documents gets stopped in the starting blocks. Making a purchase of such system is therefore not economically sound.</p><br /><br /><p> </p><br /><H3>7) Reduced legacy code compatibility of Office 2007 on 64-bit operating systems</H3><br /><br /><p>As per Microsoft own <a href="http://www.microsoft.com/servers/64bit/faqs.mspx">website</a>,</p><br /><br /><blockquote><i> Q. Are there any features in the 32-bit versions of Windows that are not in Windows Server 2003 x64 editions?<br><br><br />A. <br><br /><br><br />A small number of features are not included in x64 Windows, including DOS, POSIX, <b>16-bit support</b>, and a few legacy networking protocols no longer in active use. However, we do not expect most customers to be affected by these differences. Based on customer feedback, we expect the initial x64 usage scenarios for Windows Server 2003 to be databases, business applications, Terminal Server, Active Directory, Internet Information Services (IIS), and technical computing.</i></blockquote><br /><br /><p>If you are using old Excel 4.0/5.0 modules, user-defined functions built on 16-bit code, that work with a 32-bit operating system, you better not expect it to work on the 64-bit operating system with Microsoft Office. The reason why is that the WOW64 emulation mode does not marshall 16-bit pointer sizes. No compatibility, period.</p><br /><br /><p>In the real world, many businesses are using old 16-bit solutions on a daily basis. There is no migration path for them on 64-bit operating systems.</p><br /><br /><p> </p><br /><H3>8) OOXML supports 64-bit at all?</H3><br /><br /><p>If anything, the previous sections have illustrated that "Microsoft Office" and "64-bit operating systems" are strangers. As such, how does one make a relevant compatibility case of OOXML consumed or produced by a native non-Microsoft 64-bit application? How do we know that today's OOXML makes sense at all across platforms?</p><br /><br /><p> </p><br /><H3>9) An open standard excluding competitors</H3><br /><br /><p>The ECMA 376 proposal tries to make a case for compatibility with legacy formats, such as binary formats. Forgetting a moment that the new XML formats still contain a number of <a href="http://www.codeproject.com/KB/cs/office2007bin.aspx">binary blobs</a> and are therefore in direct contradiction with the proposal, the goal states that :</p><br /><br /><blockquote>ECMA 376, Part 1, Section 2.1 <i>The goal of this clause is to define conformance, and to provide interoperability guidelines in a way that fosters broad and innovative use of the Office Open XML file format, while maximizing interoperability and <b>preserving investment in existing files and applications</b>.</i></blockquote><br /><br /><p>It's pretty clear that the justification for the proposal is the preservation of the investment in files and applications. The preservation of the investment in applications has been addressed in the sections above of this article, where we have seen that the reality is a little different : Microsoft can't seem to be able to interoperate across their own platforms. With regards to binary files, it is just the same, as we are going to see.</p><br /><br /><p>Microsoft Office 2007 keeps secret how it migrates a binary file into a new file. The use case is neither described, documented, illustrated in ECMA 376 even though that is the stated goal of...ECMA 376. <b>Competitors are excluded from doing the same reliably.</b> Is that what we call exclusive openness?</p><br /><br /><p>We are in a case where the stated goal and content of the body of the ECMA 376 proposal don't match. This case was brought by a number of national bodies. The only national body where the discussions were made public (Microsoft originally vehemently opposed the exposure) was the US INCITS review group. This critical question remained unanswered, and it is known and widely reported that the US INCITS ballot brutally changed from a negative vote (overwhelmingly, first ballot in July 2007) to a positive vote (almost unanimously, in August 2007) after Bill Gates made a phone call to the US secretary.</p><br /><br /><p>To review the US INCITS discussions, head over to :</p><br /><pre><a href="http://www.ibiblio.org/bosak/v1mail/200704/">US INCITS V1 archive, April</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200705/">US INCITS V1 archive, May</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200706/">US INCITS V1 archive, June</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200707/">US INCITS V1 archive, July</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200708/">US INCITS V1 archive, August</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200709/">US INCITS V1 archive, September</a><br /><a href="http://www.ibiblio.org/bosak/v1mail/200710/">US INCITS V1 archive, October</a></pre><br /><br /><br /><p> </p><br /><H3>10) Reverse engineering a necessity for compatibility reasons</H3><br /><br /><p>Microsoft made clear publicly that the design of the XML vocabulary was not done at the expense of performance (performance is in the DNA of their engineering techniques, or so they say), and that they worked towards minimizing the performance hits due to the XML parsing, whose footprint is actually hidden thanks to the ZIP compression, a technique introduced by Open Office, the open source Office suite.</p><br /><br /><p>While this induced considerations such as keeping XML names as small as possible, and avoiding redundancies through redirections, both at the expense of implementors, what Microsoft won't say is that they went as far as making an extensive use of default values, that they consequently do not even store in the file.</p><br /><br /><p>It is a problem. You cannot find information when it is not in the file! Microsoft has thousands of default values that they use in a running instance of Office 2007, values that can be overridden by explicit attribute values, with the implication that non-Microsoft applications must reproduce the same in order to be compatible. All this is implementation-specific at the expense of the file format, that's where document design interferes with application design. Performance is in their DNA, but two decades of closed product development have consequences. Obviously Microsoft never designed Office 2007 with the idea that they would submit their internal specifications to an international organization.</p><br /><br /><p>In other words, short of Microsoft providing the exhaustive list of such default values, the combinations not only trivial ones, there is no way a competing application can be reliably compatible without reverse engineering Office 2007. Which defies the point of the ISO proposal.</p><br /><br /><br /><p> </p><br /><H3>11) No compliant file format</H3><br /><br /><p>There is no strict "Save As ECMA 376" in Office 2007. There is no way an agent (application, third-party, SOX) is going to be able to produce such files. If that was not enough, ECMA 376 itself is a moving target due to Microsoft's own incompetence in proposing a format that was ready for international standards bare minimum requirements.</p><br /><br /><p>How does one achieve a file format compliance under those conditions?</p><br /><br /><p> </p><br /><H3>Conclusion : this article has demonstrated that any angle you take the problem, compatibility is more wishful thinking than reality. In fact, experienced personnel will recognize much of the evolution of Microsoft Office over the years. The problem is that Microsoft is going to ISO with it. It contradicts even the most trivial assumptions. <br />Fast-tracking the ECMA 376 proposal made no sense at all. Microsoft is urged to go back to the drawing board.</H3>OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.comtag:blogger.com,1999:blog-283390674923940532.post-77729358780765241052007-08-28T08:27:00.000-07:002014-10-30T01:41:19.417-07:00Microsoft Office XML formats? Defective by design<P> </P><br /> <P>Microsoft is trying to push new file formats that are using ZIP and XML. Are those new file formats any good for Office developers ? In other words, should anyone feel safe to make direct access to file parts, and start getting free of running instances of Microsoft Office and its COM object model, usually through VBA ?</P><br /> <P>Microsoft does not run out of teasing. There is ton of videos, see <a href="http://channel9.msdn.com/ShowPost.aspx?PostID=73329">here</a>, and <a href="http://channel9.msdn.com/Showpost.aspx?postid=313246">here</a> for example, screencasts, articles and blog posts (self-serving Microsoft blog posts mostly) about how much they are opening up. It boils down to the following, excerpt from Microsoft Office 12 introduction <a href="http://blogs.msdn.com/brian_jones/archive/2005/06/01/424085.aspx"> white paper</a> :<br /> </P><br /> <P style="background-color:#EEEEEE"><SPAN><EM>(...) <br /> The use of XML offers the benefits of greater <STRONG>transparency and openness</STRONG> than were possible with the previous binary file formats. The new formats allow Office documents to easily integrate with existing and future line-of-business systems, as the contents are now open and accessible. The new formats are also <br />designed with long-term robustness and accessibility in mind. (...) </EM><SPAN> <EM>The binary file formats in use currently were designed in 1994—before the advent of XML and before widespread exchange of documents and data that is common today. These file formats, .doc, .xls, and .ppt, were introduced with the release of Microsoft Office 97, at a time when it was important to optimize the files for storage on slow hard drives and “floppy” disks; it was not as crucial to <STRONG>focus on easy access to data within the files for better content <br />reuse, document generation, and seamless integration of the documents into business processes</STRONG>. (...) </EM><SPAN><EM>The new XML-based file formats in these programs enable broader integration and interoperability between Office documents and enterprise applications. <br />Additionally, “Office 12” files are all wrapped using ZIP technologies, which allows for easy access to the content parts as well as standard compression, reducing file sizes and improving reliability and data recovery. (...) <SPAN>Because documents stored in the Open XML Formats are machine-readable and editable by any text editor or XML processor, <STRONG>solutions need not use <br />Microsoft Office programs to view or edit content within the documents.</STRONG> Enterprise business solutions can access document contents easily and efficiently. Technology providers can utilize the Microsoft Office System and Office authoring applications within their solutions, reuse Microsoft Office documents as other Office documents, or open and <STRONG>act on Office documents on other platforms and in other applications</STRONG>.</SPAN></EM></P><br /> <P> <br /> </P><br /> <P>They insist on the fact that, provided you make a valid use of the XML, pretty much changing the content of anything in an existing document can be achieved by sequentially 1) unzipping the content 2) making appropriate changes to one or more XML parts that are compatible with the provided XML schemas and open packaging relationships 3) zipping the content back .</P><br /> <P>Let's see if that's true.</P><br /> <P><FONT size="3">1) Self-exploding spreadsheets</FONT></P><br /> <P><FONT size="3">2) Entered versus stored values</FONT></P><br /> <P><FONT size="3">3) Optimization artefacts become a feature instead of an embarrasment</FONT></P><br /> <P><FONT size="3">4) VML isn't XML</FONT></P><br /> <P><FONT size="3">5) Open packaging parts minefield</FONT></P><br /> <P><FONT size="3">6) International, but US English first and foremost</FONT></P><br /> <P><FONT size="3">7) Many ways to get in trouble</FONT></P><br /> <P><FONT size="3">8) Windows dates</FONT></P><br /> <P><FONT size="3">9) All roads lead to Office 2007</FONT></P><br /> <P><FONT size="3">10) A world of ZIP+OLE files</FONT></P><br /> <P><FONT size="3">11) Document security is a (bad) joke</FONT></P><br /> <P><FONT size="3">12) BIFF is gone...not!</FONT></P><br /> <P><FONT size="3">13) Document backwards compatibility subject to neutrino radio-activity</FONT></P><br /> <P><FONT size="3">14) ECMA 376 documents just do not exist</FONT></P><br /> <P><FONT size="3">15) How the ISO OpenDocument format (ODF) compares?</FONT></P><br /> <P><FONT size="3"></FONT></SPAN> </P><br /> </SPAN></SPAN><br /> <H2>1) Self-exploding spreadsheets</H2><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<LI>insert a value 10 in a cell<LI>insert a value 20 next to it<LI>select the two cells and click on the "sum" button to create a sum of the two cells<LI>save the spreadsheet (xlsx file)<LI>close it and unzip it</UL><br /> <P>The relevant XML in the corresponding part xl/worksheets/sheet1.xml is :</P><br /> <pre><row r="2" spans="3:5"><br /> <c r="C2"><br /> <v>10</v><br /> </c><br /> <c r="D2"><br /> <v>20</v><br /> </c><br /> <c r="E2"><br /> <f>SUM(C2:D2)</f><br /> <v>30</v><br /> </c><br /></row></pre><br /> <P>Pretty simple XML. Now say we want to edit cell E2 and set a constant value of <br />40 in place of a formula. But instead of doing that with Excel 2007 interactively, we are going to do it manually :</P><br /> <UL><LI>unzip the file<LI>grab a zip part known as xl/worksheets/sheet1.xml<LI>make the edit described below<LI>put the updated zip part back in the zip package</UL><br /> <P>The corresponding valid (and carefully changed) XML for setting the constant <br /> value of 40 in cell E2 is :</P><br /> <pre><row r="2" spans="3:5"><br /> <c r="C2"><br /> <v>10</v><br /> </c><br /> <c r="D2"><br /> <v>20</v><br /> </c><br /> <c r="E2"><br /> <v>40</v><br /> </c><br /></row></pre><br /> <P>Now open the file in Excel 2007. You get a blocking error message which says :</P><br /> <P><IMG src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign1.gif"> <BR><br /> <EM>Excel 2007 cannot open the file we have manually updated</EM></P><br /> <P>Followed by another even more frightening message :</P><br /> <P><IMG src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign2.gif"><br /> <BR><br /> <EM>According to Excel 2007, the problem is that the calculation chain is now corrupt</EM></P><br /> <P>Interestingly enough, we thought parts of a spreadsheet file were updatable as long as we did not touch elements that are, according to the ECMA 376 documentation (the official <a href="http://www.ecma-international.org/publications/standards/Ecma-376.htm">paper</a> from Microsoft), indexes to other parts.<br /> </P><br /> <P>Now that's an interesting issue. The ECMA 376 documentation says that the <br />calculation chain is the graph of formulas, sorted wrt to their dependencies. Suddenly, the little change we would like to make looks way more expensive. Rebuilding the graph of formulas ourselves is what Excel itself does, and sure enough it involves parsing the entire spreadsheet, discovering formulas, and applying formula parsing algorithms to induce a graph of the dependencies. It certainly sounds like we are going to have to rewrite a portion of Excel itself. Not every employer can afford waiting that much...Or perhaps that's Microsoft's way to tell us : <i>you shall not touch this without our approbation</i>.</P><br /> <P>We can perhaps get rid of the calculation chain, if we carefully delete the part and associated relationship. But then, there are three problems :</P><br /> <P>- Parts are intertwined together through implicit and explicit relationships <br />(relationships are separate zip entries). It gets even more grainy, and the risk of corruption increases. Again, we have to further take into account a number of details that are out of our scope, like how the calculation chain is defined as per the workbook part.</P><br /> <P>- If I were a programmer, the question would be : Microsoft gives no library <br />that I can use, at least not a library that I could use no matter the execution environment. Microsoft provides an API which works in a recent .NET run-time, and installs on Windows XP SP2 and Windows Vista. There goes the platform independence.</P><br /> <P>- If I delete the calculation chain, I am admitting that the resulting spreadsheet is being degraded, something that Excel 2007 does not suffer from since it takes care of that thanks to its infrastructure. In other words, making changes outside of Excel 2007 doesn't look as first-class citizen and safe and robust as it sounded first, and in either case either the application takes care of a lot of details, replicating what Excel itself does (there is no known non-Microsoft Excel 2007 implementation available out there), or Excel 2007 will have to do that for me next time the spreadsheet is opened. If the changes are made on a server without Excel 2007 installed and the resulting spreadsheet is distributed to employees throughout the organization, then every single employee will have to get through a full spreadsheet recalculation (arbitrarily lengthy) next time they open the spreadsheet. The application is a second-class citizen compared to Excel 2007, that's what everybody thinks : me and the employees.</P><br /> <P>What this shows pretty clearly is that either we lack the tools, or Microsoft <br />does not think we should be doing that in the first place. With that being said, if making a simple change to a cell is too much to ask, then what is this new format good for? The prospect of getting our recipients facing those dreaded message boxes is not exactly a change compared to the well known ugly stories with corrupted binary file formats.</P><br /> <p>Let's play the devil's advocate now and see how far it will take us. Here is the contents of the calculation chain, xl/calcChain.xml :</p><br /><pre><br /><?xml version="1.0" encoding="UTF-8" standalone="yes"?><br /><calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><br /> <c r="E2" i="1"/><br /></calcChain><br /></pre><br /><p>The reference to formula in cell E2 is what seems to be causing the problem. All right, then since it's just XML, let's remove that reference. Edit the calculation chain xl/calcChain.xml so it looks like this :</p><br /><pre><br /><?xml version="1.0" encoding="UTF-8" standalone="yes"?><br /><calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"><br /></calcChain><br /></pre><br /><p>After the modification, if we open the file in Excel 2007, it still complains :</p><br /> <P><IMG src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign6.gif"> <BR><br /> <EM>Excel 2007 cannot open the file even if the calculation chain is cleaned up manually</EM></P><br /><p>I particularly like the wording : <i>catastrophic failure</i>. Now it looks like a lot of blood is spilling. Well, then let's stop the hemorrhage, and remove the contents of this part altogether. Make the update, and open the file in Excel 2007, it continues to complain :</p><br /> <P><IMG src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign7.gif"> <BR><br /> <EM>Excel 2007 cannot open the file even if the calculation chain is made empty</EM></P><br /><p>I guess it's time to take a look at the ECMA 376 documentation. In Part 4, page 2087, it says (emphasis mine) : </p><br /><pre style="background-color:#EEEEEE"><br />3.6.2 calcChain (Calculation Chain Info)<br />This element represents the root of the calculation chain.<br /><br /><complexType name="CT_CalcChain"><br /> <sequence><br /> <element name="c" type="CT_CalcCell" <b>minOccurs="1"</b> maxOccurs="unbounded"/><br /> <element name="extLst" minOccurs="0" type="CT_ExtensionList"/><br /> </sequence><br /></complexType><br /></pre><br /><p>I guess there we have it, we can't have a calculation chain part with no cell reference in it. Excel guts spilling through the specs here, aren't it? Wait a minute, is this supposed to make into an international standard?</p><br /> <p>Let's do a quick summary : we had Excel 2007 complain that the calculation chain was left with a reference to a formula that did not exist anymore, but in fact Excel 2007 complains even with this manually fixed. The solution is to delete the physical calculation chain part and the relationship it has with the workbook (defined in the workbook relationship part xl/_rels/workbook.rels), and to update some other parts as well.</p><br /><br /><p>Let's get a visual diff of what it takes to make a "proper" change in a cell :</p><br /><br /><TABLE BORDER=0 CELLSPACING=1 BGCOLOR=#AAAAAA><TR><TD><TABLE BORDER=0 BGCOLOR=white CELLPADDING=1 CELLSPACING=1><TR><TD WIDTH=350><FONT FACE="Verdana"><B>BookSample.xlsx (original)</B></FONT></TD><TD WIDTH=60 ALIGN=CENTER><FONT FACE="Verdana"><B>DIFF</B></FONT></TD><TD WIDTH=350><FONT FACE="Verdana"><B>BookSample.xlsx (updated)</B></FONT></TD></TR><TR CLASS="diff"><TD CLASS="bot" BGCOLOR="#FFD71D"><a href="http://www.arstdesign.com/articles/workdir/[Content_Types].xml.html">[Content_Types].xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFD71D">1</TD><TD CLASS="bot" BGCOLOR="#FFD71D"><a href="http://www.arstdesign.com/articles/workdir/[Content_Types].xml.html">[Content_Types].xml</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"><a href="http://www.arstdesign.com/articles/workdir/_rels..rels.html">_rels/.rels</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"><a href="http://www.arstdesign.com/articles/workdir/_rels..rels.html">_rels/.rels</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/docProps.app.xml.html">docProps/app.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/docProps.app.xml.html">docProps/app.xml</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/docProps.core.xml.html">docProps/core.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/docProps.core.xml.html">docProps/core.xml</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.workbook.xml.html">xl/workbook.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.workbook.xml.html">xl/workbook.xml</a></TD></TR><TR CLASS="diff"><TD CLASS="bot" BGCOLOR="#FFD71D"> <a href="http://www.arstdesign.com/articles/workdir/xl._rels.workbook.xml.rels.html">xl/_rels/workbook.xml.rels</a></TD><TD ALIGN=CENTER BGCOLOR="#FFD71D">1</TD><TD CLASS="bot" BGCOLOR="#FFD71D"> <a href="http://www.arstdesign.com/articles/workdir/xl._rels.workbook.xml.rels.html">xl/_rels/workbook.xml.rels</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.styles.xml.html">xl/styles.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.styles.xml.html">xl/styles.xml</a></TD></TR><TR CLASS="std"><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.theme.theme1.xml.html">xl/theme/theme1.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFFFFF"></TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.theme.theme1.xml.html">xl/theme/theme1.xml</a></TD></TR><TR CLASS="diff"><TD CLASS="bot" BGCOLOR="#FFD71D"> <a href="http://www.arstdesign.com/articles/workdir/xl.worksheets.sheet1.xml.html">xl/worksheets/sheet1.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#FFD71D">2</TD><TD CLASS="bot" BGCOLOR="#FFD71D"> <a href="http://www.arstdesign.com/articles/workdir/xl.worksheets.sheet1.xml.html">xl/worksheets/sheet1.xml</a></TD></TR><TR CLASS="diff"><TD CLASS="bot" BGCOLOR="#E0686E"> <a href="http://www.arstdesign.com/articles/workdir/xl.calcChain.xml.html">xl/calcChain.xml</a></TD><TD ALIGN=CENTER BGCOLOR="#E0686E">4</TD><TD CLASS="bot" BGCOLOR="#FFFFFF"> <a href="http://www.arstdesign.com/articles/workdir/xl.calcChain.xml.html"></a></TD></TR></TABLE></TD></TR></TABLE><i>Original versus updated file : 40% of parts need a change!</i> (note : click the links to get the actual changes)<br /><br /><br /><ul><li>all we wanted to do was to make a change in a cell and we end up butchering the package.<li>remember, we don't want to have to deal with formulas or anything even further far away from our concern, we have no notion of formulas nor should have, the idea is to change a value in a cell. If the file format was properly designed, all it would take is an in-place replacement of the cell contents.<li>with VBA, to make a change in a cell, we do : <code>Range("E2").Value = 40</code>. Period. End of story.<li>40% of parts have to change in order to take into account our minor change. That certainly looks like having to care of the format's own dirty laundry, this is not acceptable by any stretch of the imagination for a modern format reportedly (Microsoft source) aimed to improve interoperability across platforms and applications. This is unfortunately exactly the situation in which we are with binary formats, something the new formats were supposed to fix!<br /><li>deleting a physical part is not a minor operation. Most ZIP libraries can't delete a physical part without uncompressing everything in a temp folder and rebuilding the package (thereby simulating a delete function). If your spreadsheet is big, this will take some time and space.<li>a full recalculation in the case there were formulas elsewhere.<li>can you imagine making more complex changes like replacing a pivot table with another ? Pivot tables have many (like hundreds) ties with everything in the spreadsheet.</ul><br /> <P> </P><br /> <H2>2) Entered versus stored values</H2><br /> <P>We all take for granted that when we type a value such as <br /> 1234.1234 in a cell of a spreadsheet, that's what actually gets stored. Excel has this auto-number format matching capability where it tries to make sense of what is manually entered in order to deduce if that's a string, a number, a boolean or a date and applies a number format accordingly, but what's being stored as a value is what is entered. By the way, if you hit Alt+F11 in Excel, and enter something like Range("C3").Value and run the macro, you'll get the entered value in cell C3 ; if you enter something like Range("C3").Text, you'll get the formatted value in cell C3, where the number format has been applied to the value. Note that the return value takes advantage of the locale and number formatting which means you may get "1234,1234" (note the comma) instead of "1234.1234".</P><br /> <P>Is this storage neutrality true with the new formats?</P><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<LI>type value 123456.123456<br /> <LI>resize the column manually so that the value entirely shows<br /> <LI>hit return and type 12345.12345<br /> <LI>hit return and type 1234.1234<br /> <LI>hit return and type 123.123<br /> <LI>hit return and type 12.12<br /> <LI>save the spreadsheet (xlsx file)<br /> <LI>close it, unzip it</LI><br /> </UL><br /> <p>Here is a screenshot of what you should see at this point :</p><br /> <IMG src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign3.gif"><br><br /> <i>Typing a few numeric values in Excel 2007</i><br><br /> <P>The corresponding XML in the main part <br /> xl/worksheets/sheet1.xml is :</P><br /> <pre><sheetData><br /> <row r="2" spans="4:4"><br /> <c r="D2"><br /> <v>123456.123456</v><br /> </c><br /> </row><br /> <row r="3" spans="4:4"><br /> <c r="D3"><br /> <v>12345.123449999999</v><br /> </c><br /> </row><br /> <row r="4" spans="4:4"><br /> <c r="D4"><br /> <v>1234.1233999999999</v><br /> </c><br /> </row><br /> <row r="5" spans="4:4"><br /> <c r="D5"><br /> <v>123.123</v><br /> </c><br /> </row><br /> <row r="6" spans="4:4"><br /> <c r="D6"><br /> <v>12.12</v><br /> </c><br /> </row><br /></sheetData> </pre><br /> <P>The problem is that Excel 2007 does not store what we entered. If we read the XML, we are going to grab numbers that have rounding errors compared to the actual numbers we typed. Let's see how far the problem goes :</P><TABLE id="Table1" cellSpacing="1" cellPadding="1" width="300" border="1"><TR><TD><STRONG>Entered value</STRONG></TD><TD><STRONG>Stored value</STRONG></TD><TD><STRONG>Rounding error</STRONG></TD></TR><TR><TD>123456.123456</TD><TD>123456.123456</TD><TD>0</TD></TR><TR><TD>12345.12345</TD><TD>12345.123449999999</TD><TD>o(1e-5)</TD></TR><br /><TR><TD>1234.1234</TD><TD>1234.1233999999999</TD><TD>o(1e-4)</TD></TR><TR><TD>123.123</TD><TD>123.123</TD><TD>0</TD></TR><TR><TD>12.12</TD><TD>12.12</TD><br /><TD>0</TD></TR></TABLE><P>Not only there is a rounding error, but its order of magnitude changes depending on the value. Ironically enough, if you entered 4321.4321, it would be stored as is, with no rounding error.</P><br /> <P>It is absolutely lost on me how implementers are expected to deal with this mess. The spreadsheet does not reflect the proper values, and you can easily see where it goes. Imagine non-Microsoft applications used in healthcare and critical systems relying on the spreadsheet data. Not only the rounding error seems arbitrary (one would have to go back and study the artefacts of IEEE floating-point values, several decades of work), but it changes. There is no way we can possibly take advantage of this, with one notable exception : if we are able to be in an execution environment for which reading those floating-point values does not produce those artefacts, and returns the proper entered values, then we are good. Problem : Microsoft does not document the execution environment. We can fairly assume its Windows, but what else? And if I am using Linux, how do I work with this?</P><br /> <P>It's important to understand that if we open the spreadsheet in Excel 2007, we see the proper values. No loss (based on the values entered) seem to have occured, the problem is that the data in XML just cannot be used as is.</P><br /> <P>As an aside, the stored value does not use the locale (it always uses the dot as decimal separator), therefore we have to assume this is all US English. If we wrote software in Excel VBA that grabs the value in cells, then processes it, there is no way we could migrate our VBA code to work with this XML part without substantial rework. We are left with Excel's own international implementation artefacts, undocumented.</P><br /> <P> </P><br /> <H2>3) Optimization artefacts become a feature instead of an embarrasment</H2><br /> <P>Historically, the BIFF file format used in Excel spreadsheets was designed to be small and fast. But this design decision goes all way back to early 90s, when the Pentium CPU did not exist yet. Regular desktop computers we use everyday are at least several orders of magnitude faster and memory-friendly that those early computers were. Yet, Microsoft chose to keep those optimization artefacts as is, with the side effect that they are now exposed to the surface as part of the XML.</P><br /> <P>Among interesting optimizations is Excel insistence in trying to factor formulas as much as possible. This happens with Excel when you create a formula, then drag the cell to replicate it in other cells. That's shared formulas. Excel chooses to declare the formula itself only once, and then creates a mechanism to infer the formula in other cells (the relative <br />position counts as an offset).</P><br /> <P>Shared formulas are supposed to be transparent for developers. Is it true?</P><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<br /> <LI>enter value 2 in cell C4<br /> <LI>click on cell C4, click on the bottom-right corner, and drag cell C4 down to C10 to replicate the content<LI>enter value 3 in cell D4<br /> <LI>click on cell D4, click on the bottom-right corner, and drag cell D4 down to D10 to replicate the content<br /> <LI>enter formula =C4-D4 in cell E4<br /> <LI>click on cell E4, click on the bottom-right corner, and drag cell E4 down to E10 to replicate the formula<br /> <LI>save the spreadsheet (xlsx file)<br /> <LI>close it, unzip it</UL><br /> <P>Here is a screenshot of what you should see at this point :</P><br /> <img src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign4.gif"><br><br /> <i>Typing a few numbers and formulas in Excel 2007</i><br><br /> <P>The corresponding XML in the main part <br /> xl/worksheets/sheet1.xml is :</P><br /> <pre><br /><sheetData><br /> <row r="4" spans="3:5"><br /> <c r="C4"><br /> <v>2</v><br /> </c><br /> <c r="D4"><br /> <v>3</v><br /> </c><br /> <c r="E4" s="1"><br /> <f>C4-D4</f><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="5" spans="3:5"><br /> <c r="C5"><br /> <v>2</v><br /> </c><br /> <c r="D5"><br /> <v>3</v><br /> </c><br /> <c r="E5" s="1"><br /> <f t="shared" ref="E5:E10" si="0">C5-D5</f><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="6" spans="3:5"><br /> <c r="C6"><br /> <v>2</v><br /> </c><br /> <c r="D6"><br /> <v>3</v><br /> </c><br /> <c r="E6" s="1"><br /> <f t="shared" si="0"/><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="7" spans="3:5"><br /> <c r="C7"><br /> <v>2</v><br /> </c><br /> <c r="D7"><br /> <v>3</v><br /> </c><br /> <c r="E7" s="1"><br /> <f t="shared" si="0"/><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="8" spans="3:5"><br /> <c r="C8"><br /> <v>2</v><br /> </c><br /> <c r="D8"><br /> <v>3</v><br /> </c><br /> <c r="E8" s="1"><br /> <f t="shared" si="0"/><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="9" spans="3:5"><br /> <c r="C9"><br /> <v>2</v><br /> </c><br /> <c r="D9"><br /> <v>3</v><br /> </c><br /> <c r="E9" s="1"><br /> <f t="shared" si="0"/><br /> <v>-1</v><br /> </c><br /> </row><br /> <row r="10" spans="3:5"><br /> <c r="C10"><br /> <v>2</v><br /> </c><br /> <c r="D10"><br /> <v>3</v><br /> </c><br /> <c r="E10" s="1"><br /> <f t="shared" si="0"/><br /> <v>-1</v><br /> </c><br /> </row><br /></sheetData> <br /></pre><br /> <P>In cell E5, we see a "ref "attribute where the shared formula range applies, a "si" attribute which identifies the shared formula range (itself redundant with the "ref" attribute), and the actual formula for that cell. But in cell E6, the cell below E5 in the grid, we see a definition with just a "si" attribute. In other words, cell E6 is linked to cell E5.</P><br /> <P>Here is the problem, let's say we make a manual change to cell E5 and remove the formula. We've seen in the first section of this article that the calculation chain is left unsynched, but an additional problem is that cell E6 is also left unsynched because its "si" attribute now points to nowhere. Note that the situation isn't any better if we merely update the formula, by doing so cell E5 is fine, but linked cells will reference the new formula, not the old one. So a simple change in cell E5 actually spreads unintentionally. It goes without saying that it's very expensive to make a simple change.</P><br /> <P>That is a direct result of the optimization artefact. Someone willing to make a change cannot proceed without taking care of depending cells if the cell defines a shared formula range. The problem gets only bigger since we have to either remove the shared formula altogether in all cells, or translate the shared formula definition accordingly, which implies <br />parsing the formula (the goal was only make a change in a cell!) and making a number of offset changes, many of which are left for the user to discover (formula tokens are complex). To remove the shared formula, the actual formula definitions in linked cells have to be built, and that's where an algorithm has to find a way to create these, essentially rolling back Excel's optimization as a preliminary step.</P><br /> <P>We have a case where an optimization artefact becoming an embarassment, not a feature.</P><br /> <p>Let's see how the situation compares to the old Excel binary file format (BIFF internal format is stored inside an OLE container). Here are the corresponding BIFF records :</p><br /> <pre><br /><font color=green>// BIFF : a shared formula, and cells linked to the shared formula</font><br />[SHRFMLA 0015] 03 00 08 00 03 03 00 06 0B 00 4C 00 00 FE C0 4C 00 00 FF C0 04<br />[FORMULA 001B] 04 00 03 00 3E 00 00 00 00 00 00 00 00 40 08 00 05 00 03 FE 05 00 01 03 00 03 00<br />[FORMULA 001B] 05 00 03 00 3E 00 00 00 00 00 00 00 00 00 08 00 06 00 03 FF 05 00 01 03 00 03 00<br />[FORMULA 001B] 06 00 03 00 3E 00 00 00 00 00 00 00 F0 BF 08 00 07 00 03 FF 05 00 01 03 00 03 00<br />[FORMULA 001B] 07 00 03 00 0F 00 00 00 00 00 00 00 F0 3F 08 00 08 00 03 FF 05 00 01 03 00 03 00<br />[FORMULA 001B] 08 00 03 00 0F 00 00 00 00 00 00 00 00 C0 08 00 03 00 03 FF 05 00 01 03 00 03 00<br /><br /><font color=green>// BIFF : same than above, made understandable</font><br />[SHRFMLA 0015] (ref = 03 00 08 00 03 03) (formula = [currentrow][currentcolumn-2]-[currentrow][currentcolumn-1])<br />[FORMULA 001B] (cell row = 5 col = E) (formula is a link to the shared formula)<br />[FORMULA 001B] (cell row = 6 col = E) (formula is a link to the shared formula)<br />[FORMULA 001B] (cell row = 7 col = E) (formula is a link to the shared formula)<br />[FORMULA 001B] (cell row = 8 col = E) (formula is a link to the shared formula)<br />[FORMULA 001B] (cell row = 9 col = E) (formula is a link to the shared formula)<br /> </pre><br /> <p>With the Excel binary format, the design is right. The shared formula is defined outside a cell, so you can very simply remove the formula in cell E5 without breaking other cells. From a programming perspective, <b>the new XML format qualifies as a regression compared to the binary file format</b>.</p><br /> <P> </P><br /> <H2>4) VML isn't XML</H2><br /> <P>Contrary to what the ECMA 376 documentation says in many places, VML drawing parts are not deprecated at all. VML is in fact very pervasive in Word, Excel and Powerpoint documents, so it's even more a blatant problem.</P><br /><p>In the ECMA 376 documentation, part 4, page 4343, we learn : (emphasis mine)</p><br /><pre style="background-color:#EEEEEE"><br />[Note: The VML format is a legacy format originally introduced with Office 2000 <br />and is included and fully <b>defined in this Standard for backwards compatibility <br />reasons</b>. The DrawingML format is a newer and richer format created with the <br />goal of eventually replacing any uses of VML in the Office Open XML formats. VML <br />should be considered a deprecated format included in Office Open XML for legacy <br />reasons only and new applications that need a file format for drawings are strongly <br />encouraged to use preferentially DrawingML .end note]<br /></pre><br /> <P>Here is a way to create a VML part in a <b>new</b> document :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<br /> <LI>right-click and choose Insert Comment<br /> <LI>enter a comment<br /> <LI>save the spreadsheet (xlsx file)<br /> <LI>close it, unzip it</UL><br /> <P>The corresponding XML in the drawing part <br /> xl/drawings/vmlDrawing1.vml is :</P><br /> <pre><br /><?xml version="1.0" encoding="UTF-8" standalone="yes"?><br /><xml xmlns:v="urn:schemas-microsoft-com:vml" <br /> xmlns:o="urn:schemas-microsoft-com:office:office" <br /> xmlns:x="urn:schemas-microsoft-com:office:excel"><br /> <o:shapelayout v:ext="edit"><br /> <o:idmap v:ext="edit" data="1"/><br /> </o:shapelayout><br /> <v:shapetype id="_x0000_t202" coordsize="21600,21600" o:spt="202" path="m,l,21600r21600,l21600,xe"><br /> <v:stroke joinstyle="miter"/><br /> <v:path gradientshapeok="t" o:connecttype="rect"/><br /> </v:shapetype><br /> <v:shape id="_x0000_s1025" type="#_x0000_t202" style="position:absolute;<br />margin-left:203.25pt;margin-top:37.5pt;width:96pt;height:55.5pt;z-index:1;<br />visibility:hidden" fillcolor="#ffffe1" o:insetmode="auto"><br /> <v:fill color2="#ffffe1"/><br /> <v:shadow on="t" color="black" obscured="t"/><br /> <v:path o:connecttype="none"/><br /> <v:textbox style="mso-direction-alt:auto"><br /> <div style="text-align:left"/><br /> </v:textbox><br /> <x:ClientData ObjectType="Note"><br /> <x:MoveWithCells/><br /> <x:SizeWithCells/><br /> <x:Anchor><br /> 4, 15, 2, 10, 6, 15, 6, 4</x:Anchor><br /> <x:AutoFill>False</x:AutoFill><br /> <x:Row>3</x:Row><br /> <x:Column>3</x:Column><br /> </x:ClientData><br /> </v:shape><br /></xml></pre><br /> <P>From a pure markup perspective, it is XML, but there are application-encoded values such as <code>="m,l,21600r21600,l21600,xe"</code> and <code>4, 15, 2, 10, 6, 15, 6, 4</code> which contradict proper XML design, ie in a way that it is both poor XML, and cannot be used by XSLT transforms. VML allows far more complicated values in the general case, including conditionals expressed in their own language.</p><br /><pre style="background-color:#EEEEEE"><br /><v:shape style='top: 0; left: 0; width: 250; height: 250'<br /> stroke="true" strokecolor="red" strokeweight="2" fill="true"<br /> fillcolor="green" coordorigin="0 0" coordsize="175 175"><br /><v:path v="m 8,65<br /> l 72,65,92,11,112,65,174,65,122,100,142,155,92,121,42,155,60,100<br /> x e"/><br /><formulas><br /> <f eqn="sum #0 0 10800"/><br /> <f eqn="prod #0 2 1"/><br /> <f eqn="sum 21600 0 @1"/><br /> <f eqn="sum 0 0 @2"/><br /> <f eqn="sum 21600 0 @3"/><br /> <f eqn="if @0 @3 0"/><br /> <f eqn="if @0 21600 @1"/><br /> <f eqn="if @0 0 @2"/><br /> <f eqn="if @0 @4 21600"/><br /> <f eqn="mid @5 @6"/><br /> <f eqn="mid @8 @5"/><br /> <f eqn="mid @7 @8"/><br /> <f eqn="mid @6 @7"/><br /> <f eqn="sum @6 0 @5"/><br /></formulas><br /></v:shape><br /></pre><br /><p>If that is XML, then I propose writing C code the following way :</p> <br /><pre style="background-color:#EEEEEE"><br /><v:shape style='top: 0; left: 0; width: 250; height: 250'<br /> stroke="true" strokecolor="red" strokeweight="2" fill="true"<br /> fillcolor="green" coordorigin="0 0" coordsize="175 175"><br /><v:path v="m 8,65<br /> l 72,65,92,11,112,65,174,65,122,100,142,155,92,121,42,155,60,100<br /> x e"/><br /><formulas><br /> <f eqn="<br />int main(int argc, char** argv)<br />{<br /> printf("Hello World\n");<br /> return 0;<br />}<br /> "/><br /></formulas><br /></v:shape><br /></pre><br /><p>This is XML, right? There are angle brackets, it conforms to the XML W3C recommendation, therefore it's XML.</p><br /><p>VML also contains <b>application-specific</b> markup, with no documentation associated to it, for instance in the example above, <code>4, 15, 2, 10, 6, 15, 6, 4</code>. Because only Microsoft used this markup, there was no need to define it in its own namespace and so on. But now that VML is part of ECMA 376, either ISO accepts a vendor-specific markup, which defies the point of ISO standards in the first place, or it just contradicts ISO standards.</p><br /> <P>The implication for an implementer, or for someone willing to make a change is that there is no way someone can possibly edit this thing without a proper implementation of the VML library itself. The risk of corruption is extremely high. Obviously, Microsoft expects that VML parts are replaced by other VML parts as a whole, without a finer granularity. The problem is that VML too can contain references to objects and other parts, so that contradicts even a simple <EM>template</EM> scenario.</P><br /> <P>VML is an old, undocumented, library that speaks volume of past Microsoft lock-in strategy. Mr Bill Gates in person sent in 1998 a <a href="http://antitrust.slated.org/www.iowaconsumercase.org/011607/2000/PX02991.pdf">memo</a> to the Office product group (led by Steven Sinofsky at the time), memo undisclosed to the public thanks to the IOWA consumer case :</P><br /> <P><TABLE id="Table2" cellSpacing="1" cellPadding="1" border="1"><br /><TR><TD valign="top" style="background-color:#EEEEEE"><br /> <P>From: Bill Gates<BR><br /> Sent: Saturday, December 5 1998<BR><br /> To: Bob Muglia, Jon DeVann, Steven Sinofsky<BR><br /> Subject : Office rendering</P><br /> <P><EM>One thing we have got to change in our strategy - allowing Office documents to be rendered very well by other peoples browsers is one of <br />the most destructive things we could do to the company.</EM></P><br /> <P><EM>We have to stop putting any effort into this and make sure that Office documents very well depends on PROPRIETARY IE capabilities.</EM></P><br /> <P><EM>Anything else is suicide for our platform. This is a case where Office has to avoid doing something to destroy Windows.</EM></P><br /> <P><EM>I would be glad to explain at a greater length.</EM></P><br /> <P><EM>Likewise this love of DAV in Office/Exchange is a huge problem. I would also like to make sure people understand this as well.</EM></P><br /> </TD><br /> </TR><br /> </TABLE><br /> </P><br /> <P>The undocumented VML library shipped in Internet Explorer 5 in 2000, and has been part of Internet Explorer ever since. The DAV protocol (Distributed Authoring and Versioning) is an international cross-platform standard, open to everybody. God only knows why Bill Gates likes so much VML, and dislikes so much DAV...</P><br /> <P>For the record, the ECMA 376 documentation describes the VML markup, but it does not specify it. Much of application-defined behaviors are left for one to guess.</P><br /> <P> </P><br /> <H2>5) Open packaging parts minefield</H2><br /> <P>The underlying architecture of how zip entries relate together is called by Microsoft "open packaging conventions". What it means is that zip entries are not independent, or even related by way of a single master zip entry which would work as a directory of all zip entries of relevance. There is a logical tree of entries which uses separate zip entries to define relations between zip entries. The logical tree has nothing to do with the physical tree of zip entries in a package, despite Microsoft continuously using screenshots of Windows XP's built-in ZIP folders to mimic a folder hierarchy.</P><br /> <P>The problem with such an architecture is that a part may or may not relate to another and there is no standard way to know. Often, there is a <STRONG>r:id</STRONG> attribute right in the content of some XML part that tells the application that there is a relation, but this is not standard. By the way, Microsoft's PDF fixed format competitor called XPS is also based on the same underlying architecture, except that the team who developed XPS did not quite want to play by the same rules than the Office team. For instance the XPS main zip document entry related to one or more XPS pages with an attribute such as : <STRONG>Source="Pages/1.fpage"</STRONG>. In other words, they are not using the <STRONG>r:id</STRONG> attribute, instead relying on their own mechanism. This makes it impossible for a generic library to know which part relates to which part, and it has an unfortunate consequence.</P><br /> <P>The unfortunate consequence is being unable to know whether a part relates or not to another part makes it impossible to know, when you delete a part, if you are going to corrupt the document or not. The document becomes corrupt if it points or relates (implicitely or explicitely) to a missing part. It's unclear why Microsoft chose this way of doing things, obviously leading to an internal chaos, instead of just copying the research from the OpenOffice project, where a central directory is used (OpenOffice ZIP initiative predates Microsoft's by at least three years, despite Microsoft stealing the thunder).</P><br /> <P>When you don't know the dependencies of a part, the consequence is obvious, you leave those parts alone. If you do this enough times, it clutters up the package, and soon enough you end up with a package containing any number of parts god only knows why they are there. Add to this you can add a part of any content type (arbitrary MIME type), and you have a recipe for disaster. Among other things, virus could proliferate.</P><br /> <P>Microsoft's deletePart() function which is available in their System.IO.Packaging library (itself part of .NET), does not solve this problem. We have a case of poor engineering, creating unnecessary problems for others to worry about.</P><br /> <P> </P><br /> <H2>6) International, but US English first and foremost</H2><br /> <P>An important ongoing tension with Office documents is the support for locales. Microsoft historically used a number of mechanisms to address this need, but they kept evolving and Microsoft aggregated all mechanisms to keep compatibility with older versions. What was hidden is being surfaced with the new XML. Anything that gets displayed, calculated, rendered or stored depends one way or another on an complex and undocumented combination of locale settings including : the Office application language, the Office application language settings (per application), the Office document language settings (per document), the system locale of the operating system.</P><br /> <P>To save them time, Microsoft chose to store XML using the US English locale regardless of all settings above.</P><br /> <P>This has an unfortunate consequence for implementers or those willing to make a manual change. Indeed, Microsoft is imposing everybody else to adapt to US English locale options (separators, date formats, formula conventions, ...) despite the fact that when using Office interactively, this fact is hidden to the user. The Office application infrastructure manages to abstract it away from users, which is a good thing. <b>Office developers using VBA all over the world are used to work with localized functions, the complexity is hidden to them</b>. But since the XML resurfaces this US English locale, all the complexity is left for one to implement. We are talking two decade worth of internationalization issues, for Office-related locale issues and Windows-related locale issues. To get an idea of how bad the situation is, suffice to say that a Microsoft employee part of the internationalization team in Windows has a <a href="http://blogs.msdn.com/michkap/default.aspx">blog</a> where he posts daily horror stories.</P><br /> <P>Also, for Excel formulas, it means the formula names are US English formula names, which you'll never see in Excel if you are using a locale version such as French or Brazilian. It's left for one to guess how to map function names one way to another, and of course the ECMA 376 documentation does not provide those localized formula names. If you intend not to implement a mapping to a locale, ideally your customer's locale, it implies you are willing to work with US English function names (plus US English separators, ...). <B>If your company has invested in libraries or developed libraries in-house, they cannot be used anymore.</B></P><br /> <P>Can it get any worse than that?</P><br /> <P>Unfortunately yes.</P><br /> <P>Despite Microsoft insistence to store everything using the US English locale, they still manage to store a number of contradicting country/encoding flags in the XML. Examples of that are DrawingML and VML languages. They store encoding tags for storing text chunks, but text chunks in document itself does not use any such encoding tag. It is in fact entirely possible that DrawingML and VML are implementations which involve nothing localized itself but which store localized tags in the document, while the rest of languages (WordML, SpreadsheetML, ...) are implemented otherwise : their implementation is chockful of encoding settings, but they need not store anything in the document itself. In other words, everything gets localized at run-time with WordML, SpreadsheetML, ... except DrawingML and VML.</P><br /><P>It's clear at this point that the legacy shows...One would have expected Microsoft to fix this once for all, provide a consistent framework. They chose not do so, and as a result, it's left for any implementer or someone willing to make a change to do the heavy lifting. What we are talking about here is entire internationalization implementation stacks which can represent years of work and stabilization. Ironically enough, you will not only have to implement this stuff (reverse engineering since it is not addressed by the ECMA 376 documentation), you will have to implement in a way that reproduces current Office flaws. No matter how correct your implementation is, you have to retrofit it to work just like Office does.</P><br /><p>To get a flavor of non-US English within US-English (thereby violating ECMA 376's own rules), all you have to do is to insert a chart :</p><br /><ul><li>start Excel 2007 and create a new spreadsheet<li>enter 10, hit tab, 20, hit tab, 30<li>select those 3 cells and insert a chart<li>insert a chart title<li>edit the chart title and give it whatever name you fancy<li>save the spreadsheet (xlsx file)<li>close it, unzip it</ul><p>Here is an excerpt of the chart part xl/charts/chart1.xml :</p><br /><pre style="background-color:#EEEEEE"><br /><c:chartSpace xmlns:c="http://schemas.openxmlformats.org/drawingml/2006/chart"<br /> xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"<br /> xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"> <br /> <c:lang <b>val="fr-FR"</b>/><br /> <c:chart><br /> <c:title><br /> <c:tx><br /> <c:rich><br /> <a:bodyPr/><br /> <a:lstStyle/><br /> <a:p><br /> <a:pPr><br /> <a:defRPr/><br /> </a:pPr><br /> <a:r><br /> <a:rPr <b>lang="fr-FR"</b>/><br /> <a:t>Some title</a:t><br /> </a:r> <br />...<br /></pre><br /><p>A reader of the article also mentions that within a strict US-English stream, you can get localized text formatting. Here is how :</p><br /><ul><li>start Excel 2007 and create a new spreadsheet<li>click on the Insert ribbon tab and click on Header and Footer<li>type some text in the edit area<li>select some of this text, click on the Home ribbon tab, and put this selection on bold and italic<li>save the spreadsheet (xlsx file)<li>close it, unzip it</ul><p>I am using a French version of Excel 2007. Here is an excerpt of the chart part xl/worksheets/sheet1.xml :</p><br /><pre style="background-color:#EEEEEE"><br /><headerFooter><br /> <oddHeader>&Ctrt&"-,<b>Gras Italique</b>"uiy tuieyrtui</oddHeader><br /></headerFooter> <br /></pre><br /><p><b>Gras</b> and <b>Italique</b> are French for <b>Bold</b> and <b>Italic</b>. Just because I am using a French version of Excel 2007, the format produced inserts French localized fragments, therefore anyone willing to read Excel 2007 files in the most general case must be ready to parse non-US English. The ECMA 376 documentation says, in section 3.3.1.36, Part 4, page 1965 (emphasis mine) :</p><br /><pre style="background-color:#EEEEEE"><br />&"font name,font type" - code for "text font name" and "text font type", where <br />font name and font type are strings specifying the name and type of the font, separated <br />by a comma. When a hyphen appears in font name, it means "none specified". Both of font <br />name and font type <b>can be localized values</b>.</pre><br /><p>What <b>can be</b> is supposed to mean? Who reviewed this documentation? And where are the localized values to expect?</p><br /><p>Reading the documentation I have the feeling that :</p><br /><ul><li><b>can be</b> is short for <i>it's complicated, we couldn't get this written on paper in a small space</i><li>the localized names are not provided, so even if I try to do the heavy lifting myself despite the fact that SpreadsheetML violates its own rules (it's supposed to be encoded using US English), where can I get the material to write this library?</ul><br /> <P> </P><br /> <H2>7) Many ways to get in trouble</H2><br /> <P>The extensiveness of the ECMA 376 documentation, over 6000 pages, is telling how much legacy Microsoft is willing to bring into the future. Taking an example of such legacy clarifies what it takes to implement even a portion of the documentation. The example is <EM>text formatting</EM>. Any of the 3 applications, Word, Excel and Powerpoint uses its own text formatting markup. Worse, the shared libraries themselves (VML, DrawingML, MathML, ...) also use separate text formattings, each different. Even worse, if that's possible, Word has many own ways to do text formatting. Excel has many own ways to do text formatting. Powerpoint has many own ways to do text formatting.</P><br /> <P>By "many ways" is meant different markup, sometimes drastically different : in one you could have no country/encoding at all, and in another it's cluttered up with country/encoding markup.</P><br /> <P>If Microsoft were to design a general purpose Office document model (note : ECMA 376 is a description of one specific Office document : Microsoft's), they would have factorized all of this into a single text formatting markup. God only knows why they chose not to do so, keep all the legacy, and try to get away with this mess by making as little publicity as <br />possible about it.</P><br /> <P>Now enter the implementer, or someone willing to make a change to a document. There are three scenarios :</P><br /> <UL><LI>write a document<LI>read a document<LI>read and write the document</UL><br /> <P>The third scenario is just a combination of the others, so there is nothing interesting to say about it.</P><br /> <P>The first scenario is the most simple. To write a document, of a given type, including a given set of objects (from shared languages or not), you only need to write the document in a way that is compatible with the expected XML. In other words, you can use only one text formatting markup model. It's you who decides which one, whether you implement one or more, and so on. So from a writer perspective, you don't suffer the problem very much.</P><br /> <P>Now consider the second scenario. To read a document, you cannot assume what's in that document, therefore you've got to implement all possible combinations of objects that may be part of the document. In particular, you've got to implement all ways to get text formatting markup models because that may well be the XML you face. This is a horrible scenario. To support this scenario, either you are Microsoft, or you have a number of years of work ahead on the subject with plenty of implementation done already. There is no way around, the barrier to entry to this scenario is sky high.</P><br /> <P>Of course, if you read a document, read the markup, and do nothing with it, or nothing of substance with it, it's not quite the same problem. But then, remember that even reading a small chunk of markup can be complicated because of the implicit semantics. You don't need a lot of XML markup to find yourself unable to process it in any meaningful way.</P><br /> <P>To give you an example of how bad the situation is, here is 4 different Excel text formatting markup chunks, all meant to do the same thing (not entirely accurate here, but you get the idea) :</P><br /> <P>1) regular cell formatting</P><br /> <p><xf numFmtId="0" fontId="2" fillId="0" borderId="0" xfId="0" <br /> applyFont="1"/><font><sz val="11"/><color theme="6" <br /> tint="-0.249977111117893"/><name val="Calibri"/><family <br /> val="2"/><scheme val="minor"/></font></p><br /> <p>2) shared-string cell formatting (note that the shared-string is a technical <br /> artefact surfacing as everyone's problem now)</p><br /> <p><br /> <r><rPr><sz val="11"/><color rgb="FFFF0000"/><rFont <br /> val="Calibri"/><family val="2"/><scheme <br /> val="minor"/></rPr><t>ruir</t></r></p><br /> <p>3) cell formatting in a conditional alert (note: the conditional alert itself is <br /> declared elsewhere)</p><br /> <p><dxf><font><b/><i val="0"/></font><numFmt <br /> numFmtId="2" formatCode="0.00"/><fill><patternFill <br /> patternType="solid"><fgColor auto="1"/><bgColor <br /> rgb="FFFFFFFF"/></patternFill></fill></dxf></p><br /> <p>4) text formatting in charts</p><br /> <p><br /> <c:rich><a:bodyPr/><a:lstStyle/><a:p><a:pPr><a:defRPr/></a:pPr><a:r><a:rPr <br /> lang="en-US"/><a:t>t t</a:t></a:r><a:r><a:rPr <br /> lang="en-US" sz="1850" u="dash" <br /> baseline="0"><a:solidFill><a:schemeClr <br /> val="accent4"><a:lumMod val="60000"/><a:lumOff <br /> val="40000"/></a:schemeClr></a:solidFill><a:uFill><a:solidFill><a:schemeClr <br /> val="accent1"><a:lumMod val="60000"/><a:lumOff <br /> val="40000"/></a:schemeClr></a:solidFill></a:uFill></a:rPr><a:t>ruiry</a:t></a:r><a:r><a:rPr <br /> lang="en-US"/><a:t>t <br /> gfgfgfg</a:t></a:r></a:p></c:rich></p><br /> <P> </P><br /><p>The beauty about a file format that is impossibly hard to read and update, and is decently easy to write from scratch, is that it fits perfectly in the <i>read-only model</i> that is exactly the reason why Microsoft has a monopoly in Office documents. As a side effect to its proprietary-ness (Office 2007 documents are extensions of ECMA 376 documents), it provides zero interoperability with anything else that exists.</p><br /><p>From a technical marketing perspective, you can always try to start a project that will support ECMA 376, but your chances to complete the project is exactly zero. You can hear about projects starting here and there, you can hear about programs that write documents that can be opened in Office 2007 (note that the only test that can be possibly be made is whether or not Office 2007 opens it, regardless of the flaws of said program, regardless of the impedance mismatch between Office 2007 documents and ECMA 376 documents), but that is not evidence of progress in interoperability across applications and platforms, contrary to what the claims are.</p><br /><p>For instance, to this date there isn't a computer program that perfectly mimics any of the old Microsoft Office versions 97-2000-XP-2003. It just does not exist. In addition to the impossibility of perfectly mimicing a complex and proprietary software, third parties that would go too far supporting the binary file formats would face the wrath of violating Microsoft intellectual property. Example : VBA macros. With Office 2007, Microsoft is bringing all of this forward for backwards compatibility reasons, therefore "opening up" changes nothing. What was proprietary is still proprietary, and barely referenced in ECMA 376.</p><br /><p> </p><br /> <H2>8) Windows dates</H2><br /> <P>Microsoft uses all kinds of date types, not just from one Office application to next, but even at a library level, there are differences. Despite storing document metatadata properties of type date in an ISO compatible format (metadata properties are for instance the creation date of the document), Microsoft insists on using their legacy date types elsewhere. Unfortunately, this has consequences.</P><br /> <P>It's well documented elsewhere that the date type Microsoft uses for spreadsheets is basically flawed for a number of reasons. But what isn't often said is that the date type is actually the OLE date type. Here are the corresponding Windows OLE API functions :</P><br /> <UL><LI><a href="http://msdn2.microsoft.com/en-us/library/ms221440.aspx">VariantTimeToSystemTime</a><br /> <LI><a href="http://msdn2.microsoft.com/en-us/library/ms221646.aspx">SystemTimeToVariantTime</a><br /> </UL><br /> <P>Here is an excerpt of the VariantTimeToSystemTime function documentation, this may sound familiar if you have ever worked with Excel dates.</P><br /> <P><TABLE id="Table3" cellSpacing="1" cellPadding="1" border="1"><TR><br /> <TD valign="top" style="background-color:#EEEEEE"><br /> <P>Converts the variant representation of time to system time values.</P><br /> <PRE>INT VariantTimeToSystemTime( <br /> double vtime, <br /> LPSYSTEMTIME lpSystemTime <br />);</PRE><br /> <P>A variant time is stored as an 8-byte real value (<B>double</B>), representing a date between January 1, 100 and December 31, 9999, inclusive. The value 2.0 represents January 1, 1900; 3.0 represents January 2, 1900, and so on. Adding 1 to the value increments the date by a day. The fractional part of the value represents the time of day. Therefore, 2.5 represents noon on January 1, 1900; 3.25 represents 6:00 A.M. on January 2, 1900, and so on. Negative numbers represent the dates prior to December 30, 1899.</P><br /> <P>Using the SYSTEMTIME structure is useful because: </P> <UL> <LI>It spans all time/date periods. MS-DOS date/time is limited to representing only those dates between 1/1/1980 and 12/31/2107.<LI>The date/time elements are all easily accessible without needing to do any bit decoding.<LI>The National Language Support data and time formatting functions <B>GetDateFormat </B>and <B>GetTimeFormat</B> take a SYSTEMTIME value as input.<LI>It is the default Win32 time and date data format supported by Windows NT and Windows 95.</UL><br /> <P>The <B>VariantTimeToSystemTime</B> function will accept invalid dates and try to fix them when resolving to a VARIANT time. For example, an invalid date such as 2/29/2001 will resolve to 3/1/2001. Only days are fixed, so invalid month values result in an error being returned. Days are checked to be between 1 and 31. Negative days and days greater than 31 results in an error. A day less than 31 but greater than the maximum day in that month has the day promoted to the appropriate day of the next month. A day equal to zero resolves as the last day of the previous month. For example, an invalid dates such as 2/0/2001 will resolve to 1/31/2001.</P><br /> </TD><br /> </TR><br /> </TABLE><br /> </P><br /> <P> </P><br /> <P>As explained, the function makes all kinds of fixup internally, and that's exactly the problem. The date type used in Excel, which is that one, is incompatible with everything else out there, platform-dependent, and undocumented (this documentation describes behaviors, but it does not specify the date type).</P><br /> <p>If Windows dates were replaced by ISO dates, a dependency on Windows would be gone.</p><br /> <P>So, to read a cell containing a date in a spreadsheet and make sense of it implies you are using one of these two OLE API function calls. Even better, there is no cell date type. A cell is either a string, a number, ... but a date is a number with no associated type. In fact, the only way to know the cell contains a date is to lookup and parse the number format that may be associated to that cell. And that's where you enter another black hole.</P><br /> <P>Here is an example of number format :</P><br /> <pre><br /><numFmt numFmtId="171" formatCode="_-* #,##0.00\ _F_-;\-* #,##0.00\ _F_-;_-* "-"??\ _F_-;_-@_-"/></pre><br /> <P> </P><br /> <P>And another :</P><br /> <pre><numFmt numFmtId="171" formatCode="[$-40C][Red][>120]_-* #,##0.00\ _F_-;\-* #,##0.00\ _F_-;_-* "-"??\ _F_-;_-@_-"/></pre><br /> <P> </P><br /> <P>The second one is particularly interesting, it says : use the French locale (40C), if the number is greater than 120, apply red, and do formatting as follow, with plenty of padding/layout special characters (described in the ECMA 376 documentation, but not specified). Implementing those undocumented patterns requires north of 10,000 lines of code, all subject to wild guesses and platform interoperability problems.</P><br /><p>An even more interesting bit is that the XML used in ECMA 376 departs in a number of non-standard ways compared to the XML produced by Excel 2003 (data-only XML). Let's make a comparison.</p><br /><P>To reproduce the scenario :</P><br /><UL><LI>start Excel 2007 and create a new spreadsheet<li>enter <i>11 Jan 2001</i> in a cell (note : translate this date according to your own regional settings)<li>save the spreadsheet (xlsx)<li>save the spreadsheet as Excel 2003-compatible (the exact file type option is <i>XML Spreadsheet 2003 (*.xml)</i>)<li>close it<li>unzip the xlsx package and view the part called xl/worksheets/sheet1.xml<li>open the Excel 2003-compatible XML file</ul><br /><p>Here is the relevant XML you'll see in xl/worksheets/sheet1.xml :</p><br /><pre style="background-color:#EEEEEE"><br /><row r="3" spans="3:3" ht="15" customHeight="1"><br /> <c r="C3" s="1"><br /> <v>36902</v><br /> </c><br /></row> <br /></pre><br /><p>If you are puzzled by what the value <i>36902</i> might be, it's Excel's Windows date encoding! It is not possible to infer the cell contains a date. You have to explore the number format attached to style s="1" which in turn is described in another part of the package, xl/styles.xml :</p><br /><pre style="background-color:#EEEEEE"><br /><cellXfs count="2"><br /> <xf id="0" numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/><br /> <xf id="1" numFmtId="168" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"/><br /></cellXfs><br /></pre><br /><p>In turn, numFmtId is defined in a separate collection of the same part xl/styles.xml :</p><br /><pre style="background-color:#EEEEEE"><br /><numFmts count="1"><br /> <numFmt numFmtId="168" formatCode="\D\a\t\e\ \:\ d\ mmm\ yyyy"/><br /></numFmts> <br /></pre><br /><p>If you parse the proprietary encoding in the number format adequately, you might be able to infer it's a date. To compute the actual date involves more effort, obviously.</p><br /><p>Let's take a look at our Excel-2003 compatible XML :</p><br /><pre style="background-color:#EEEEEE"><br /><Row ss:Index="3" ss:AutoFitHeight="0"><br /> <Cell ss:Index="3" ss:StyleID="s63"><br /> <Data ss:Type="DateTime">2001-01-11T00:00:00.000</Data><br /> </Cell><br /></Row><br /></pre><br /><p>This old Excel 2003 XML is not nearly as bad as the one in ECMA 376. We can infer it's a date thanks to the <i>ss:Type</i> attribute. And, goodness, the date is encoded using <a href="http://en.wikipedia.org/wiki/ISO_8601">ISO 8601</a>, which is definitely a good thing for interoperability purposes.</p><br /><p>The obvious question : why is the new SpreadsheetML using a proprietary date encoding, when Microsoft managed to ship Excel 2003 (this Excel 2003 XML can also be generated with Excel 2007) with dates supporting an international standard encoding, and no need to go into proprietary parsing? Why isn't SpreadsheetML an extension to the old Excel 2003 XML following the same good principles?</p> <br /> <P> </P><br /> <H2>9) All roads lead to Office 2007</H2><br /> <p>In part 2 of ECMA 376, page 96, we learn :</p><br /> <pre style="background-color:#EEEEEE"><br />Table H–1. Package model conformance requirements<br /><br />M1.30 The package implementer shall<br />name relationship parts<br />according to the special<br />relationships part naming<br />convention and require that<br />parts with names that conform<br />to this naming convention have<br />the content type for a<br />Relationships part<br /> </pre><br /><p>In theory, this requirement allows unattended part renaming. But this means in practice that some process in the processing chain may reshuffle an entire package on its own, which may break assumptions from other processes of the processing chain.</p><br /><p>Interestingly enough, Excel 2007 takes it to heart to do this reshuffling. If you create a package with part names and relationships that perfectly conform to <i>open packaging conventions</i> as defined by ECMA 376 part 2, then this by no means provide neutrality throughout the entire processing chain, especially if you open the package in Excel 2007 and save it without making changes.</p><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<br /> <LI>save the spreadsheet (xlsx file)<br /> <LI>close it<br /> <LI>unzip it<br /> <LI>edit [Content_Types].xml part manually, and replace /xl/worksheets/sheet1.xml by /xl/custom/sheet1.xml<br /> <LI>edit xl/_rels/workbook.xml.rels, and replace worksheets/sheet1.xml by custom/sheet1.xml<LI>grab xl/worksheets/sheet1.xml locally as a file<br /> <LI>delete the worksheets "folder"<br /> <LI>create "folder" xl/custom in the package<br /> <LI>add the worksheet part in "folder" xl/custom<br /> <LI>zip it<br /> <LI>open it in Excel 2007<br /> <LI>click save<br /> <LI>close it<br /> <LI>unzip it</UL><br /><p>What you should see is that our custom structure has been replaced with Excel 2007 hardcoded structure. If a business process assumes the existence of a custom structure, it won't work. Why is Excel 2007 reshuffling the package? It's ours!</p><br /><br /> <P> </P><br /> <H2>10) A world of ZIP+OLE files</H2><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 2007 and create a new spreadsheet<br /> <LI>click on the Office button, select the Prepare menu option, and then Properties<br /> <LI>enter metadata such as author, title, ...<br /> <LI>click on the Office button, select the Prepare menu option, and then Encrypt Document<LI>enter a password, re-enter the password<LI>save the spreadsheet (xlsx file)<br /><LI>close it</UL><br /> <p>If you try to unzip it, you'll get an error. When you password-protect any Office 2007 document, it becomes an OLE document. Wait a minute, isn't Microsoft moving to ZIP files ?</p><br /> <P>Here is a screenshot of what you should see in an OLE document viewer :</P><br /> <img src="http://www.codeproject.com/KB/office/ooxml_is_defective/office2007bin9.gif"><br><br /> <i>A password-protected MS Office 2007 document is a...OLE file, not a ZIP file</i><br><br /><br /> <p>If you roll-over your mouse on the file in Windows Explorer (note : behavior not tested on Windows Vista), you should get a flying tool-tip with just minimal information that is provided by the disk file system, but not the document metadata (author, subject, title, keywords, last modification date, ...)</p><br /><br /> <p>The OLE container contains a number of OLE streams. The document metadata is not available for consumption, meaning that you cannot retrieve this information either. This is a regression compared to binary file formats.</p><br /> <p>In fact, there are two regressions : the document type changes ; the metadata is not accessible. And the password-protection mechanism is undocumented as a whole.</p><br /> <p>Let's see how exactly it compares to an old binary password-protected Excel file.</p><br /> <P>To reproduce the scenario :</P><br /> <UL><LI>start Excel 97 or 2000 or XP or 2003 and create a new spreadsheet<br /> <LI>go in File / Properties<br /> <LI>enter metadata such as author, title, ...<br /> <LI>go in Tools / Options, select the Security tab<br /> <LI>enter a password in field <i>Password to open</i> and click OK, re-enter the password<LI>save the spreadsheet (xls file)<LI>close it</UL><br /><br /> <p>If you roll-over your mouse on the file in Windows Explorer, you should get a flying tool-tip with <b>all</b> the information that is provided by the disk file system plus <b>all</b> the document metadata (author, subject, title, keywords, last modification date, ...)</p><br /> <br /> <p>Indeed, if we open the xls file in an OLE document viewer, the file structure is kept intact, only the OLE document stream itself is encrypted (contains the encrypted BIFF content). The stream holding the metadata, <i>ISummaryInformation</i>, is left unencrypted. Here is a screenshot of a password-protected binary Excel file in an OLE document viewer :</p><br /><br /> <img src="http://www.codeproject.com/KB/office/ooxml_is_defective/DefectiveByDesign5.gif"><br><br /> <i>A legacy password-protected MS Office document is still an OLE file</i><br><br /><br /><p>This has unfortunate consequences for Content Management Servers (CMS). By making the metadata of password-protected Office 2007 documents unavailable, existing CMS workflows are arbitrarily broken. By changing the content type of the document, only because it's password-protected, existing CMS workflows have another reason to break.</p><br /><p>Note : if Microsoft Sharepoint server, or Windows Vista, manages to show the metadata of a password-protected Microsoft Office 2007 document, perhaps it knows something that we don't...</p><br /><br /> <P> </P><br /> <H2>11) Document security is a (bad) joke</H2><br /><P>To reproduce the scenario :</P><br /><UL><LI>start Excel 2007 and create a new spreadsheet (or open an existing spreadsheet).<br /><LI>right-click on the first sheet tab, and select Protect Sheet...<LI>enter a password, hit enter, re-enter the password, hit enter.<LI>try editing cells, you should get an error message.<LI>save, close, unzip.<br /> <LI>edit the zip part named xl/worksheets/sheet1.xml<LI>find the XML element called <b>sheetProtection</b><LI>remove the element <b>sheetProtection</b> and the accompanying attributes (<b>password</b>, <b>sheet</b>, ...).<LI>save the changes and put that part back in the zip file.<LI>open the file in Excel 2007.<LI>try editing some cells, <b><font color=red>it works!</font></b></UL><br /><p>Can you think of a worse document security? We have a format that encrypts buffers into OLE containers when it shouldn't, and leaves XML password hashes right in the ZIP file in such a way that anyone can manually edit the file and remove it.</p><p>When you consider the amount of professional spreadsheet workers out there relying on sheet protection to ensure the integrity of their spreadsheets, you have no idea the impact of this flaw when they learn it...</p><p>There goes another regression with those formats. Note : the Excel binary format had a similar PASSWORD record stored on a per worksheet basis, but the big difference is that none of the target audience is able to edit a binary Excel file. It's binary, therefore it's a form of security by obscurity. And since binary files contain pointers to the stream, removing a record using an hexadecimal editor without taking care of the dirty laundry would automatically result in a corrupt spreadsheet.</p><p>I guess it's fair to say it's a case where document security needs a calculated chain to make such manual changes less easy ! (see the first section of that article to get the irony in full perspective).</p><br /><br /> <P> </P><br /> <H2>12) BIFF is gone...not!</H2><br /><p>BIFF (Binary Interchange File Format) is the name of the Excel binary file format. A BIFF buffer is contained within an OLE container, and consists in a sequence of BIFF records. A BIFF record consists in an identifier, a length to follow, and the corresponding buffer of said BIFF record. BIFF itself is platform-neutral, but of course OLE isn't. Anything stored within BIFF records referencing Windows-specific functions (such as the DEVMODE structure of printer settings) voids any platform-neutral claims. Every Excel release has its own BIFF-specific records. Those set of collective records are called BIFF8 for Excel 97, BIFF9 for Excel 2000, BIFF10 for Excel XP and BIFF11 for Excel 2003.</p><br /><p>One of the claims from the introduction is that ECMA 376 documents are moving to ZIP and XML exclusively. This implies that BIFF is gone. Let's see if that's true.</p><br /><p>(emphasis mine)<br /><pre style="background-color:#EEEEEE"><br />From: Doug Mahugh (Microsoft)<br />Sent: August 21, 2006<br />To: Stephane Rodriguez<br /><br />Stephane,<br /><br />I shared the link with some of the people over in the BIFF12 group today, <br />and they liked the article too. They had some additional info for you which <br />may be of interest. These sorts of details <b>will all be documented</b> in <br />the BIFF12 documentation that will come out when Office ships, but for now <br />there's <b>no way</b> you could know what's going on for sure (as you explain <br />in the article)<br /><br />(...)<br /><br />Doug Mahugh<br />Office 2007 Technical Evangelist<br />425-707-1182 | MSDN Blog | OpenXmlDeveloper.org<br /></pre><br /></p><br /><p>The subject of matter? This <a href="http://www.codeproject.com/cs/library/office2007bin.asp">article</a>.</p><br /><p>Ironically enough, to this date, the only article shedding a light on BIFF12 available on the internet is this article. Microsoft made no such document available.</p><br /><p>There is more than meets than eye.</p><br /><p>Besides the point that BIFF is actually not gone at all, an even bigger shocker is that BIFF12 is a departure from BIFF11. It's not an extension of BIFF11 with new BIFF records, it's a complete rewrite of all BIFF records, new BIFF record model and new content encoding. So any person who would have invested in a BIFF library (or grabbed it elsewhere), will have to redo that work from scratch. Where is backwards compatibility gone ?</p><br /><p>According to Microsoft's Excel <a href="http://blogs.msdn.com/excel/archive/2006/07/20/671995.aspx">blog</a> :</p><br /><pre style="background-color:#EEEEEE">File Format Number 2 - Excel Binary (XLSB files)<br />The Excel binary format is the second full fidelity format for Excel 2007. It is<br />similar to the Office Open XML format in structure - a set of related parts, in a <br />zip container - except that instead of each part containing XML, each part contains <br />binary data.<br /><br />Even though we've done a lot of work to make sure that our XML formats open quickly <br />and efficiently, this binary format is still more efficient for Excel to open and <br />save, and can lead to some performance improvements for workbooks that contain a lot <br />of data, or that would require a lot of XML parsing during the Open process.<br /></pre><br /><br /><p>It's hard enough to understand that Microsoft Office is moving to XML, and still away from XML at the same time, but the official justification is performance related. Unfortunately, this justification is just a lie since it is already the justification for the extreme akwardness of the SpreadsheetML XML file format itself. Someone from the Microsoft Office team has <a href="http://blogs.msdn.com/brian_jones/archive/2006/10/26/performance-of-xml-file-formats.aspx">this</a> to say about it :</p><br /><pre style="background-color:#EEEEEE">There are a number of things we looked into doing to help improve the performance of <br />these files (SpreadsheetML) both using current technology as well as thinking about <br />future approaches. Some of the approaches we took that are easiest to understand are:<br /><br />1. Reduce the need to fully parse repeating data - In order to cut back on parsing <br />large strings repeated multiple times, or formulas that are repeated down an entire <br />column, spreadsheetML does a lot of sharing. I already talked about the huge benefits <br />we can get from the shared formulas work in a post last spring. The shared string <br />table can give similar results.<br />2. Tag Size - the size of XML elements actually does directly affect performance times. <br />The more text you have to parse, the longer it will take.<br />3. Splitting the XML into multiple parts - In SpreadsheetML we break the XML content <br />out into multiple parts in the ZIP (each worksheet, the string table, pivot table data, <br />etc.). This can help a ton with on demand loading, or even loading multiple pieces at <br />once. For example, it means that if you had a multi-proc machine, you could load each <br />worksheet on a separate thread. It also means that you could decide to only load on <br />sheet and you wouldn't have to parse through all the XML from the other sheets.<br />4. Relationships stored outside of content - By storing all the relationships from one <br />part to another in separate (and much smaller) files, it makes it really easy to see <br />what other parts you should load when you are loading a particular part of the file. <br />If it weren't for the relationships, you'd actually have to parse through the content <br />of the XML to determine what other resources that part used. For example, if you wanted <br />to just load one slide in a presentation, you can also see what images are used on that <br />slide before you start parsing the XML for the slide. <br /></pre><br /><br /><p>At this point, we are left with the obvious question, if the SpreadsheetML is made much more complex than we would have expected only to cope with performance problems, what is the rationale for the binary workbook (.xlsb) ?</p><br /><p>Microsoft won't tell. There are two reasons however : 1) embarassing reality of XML 2) embarassing reality of ECMA 376</p><br /><br /><p>To illustrate the first untold reality, suffice to create a new spreadsheet, and then query external data. As you do this, Excel 2007 creates a part called the connection data source part, where <b>it stores the connection strings in plain text</b>, among other things. It should be clear by now that connection strings contain sensitive information such as server names, login and passwords. OOps!</p><br /><p>The quick Microsoft solution to this? Security by obscurity, just turn this stuff into binary records (BIFF12), and the problem goes away in theory.</p><br /><p>A similar problem with XML parts is that password hashes are stored in plain-text, as we've seen in the previous section, meaning that armed with a simple text editor, both password hashes and connection string passwords can be edited and/or removed. Vulnerabilities by design?</p><br /><p>The embarassing fact for ECMA 376 is that since this is supposed to be XML only, the new binary workbook, despite being the official answer to the "plain-text" problem, cannot be part of said documentation otherwise automatically violating the claim that those documents are made with XML.</p><br /><p>Last but not least, a pseudo-rebuttal was posted to address the lack of availability of the BIFF12 documentation, a minor problem in comparison to the XML violation rules by the way. This pseudo-rebuttal was explaining that the documentation of binary formats can be freely obtained. Here is the corresponding <a href="http://support.microsoft.com/kb/840817">article</a>. In which we learn :</p><br /><pre style="background-color:#EEEEEE">Microsoft makes its .doc, .xls, and .ppt binary file format specifications <br />available under a royalty-free covenant not to sue to anyone who wishes to <br />implement all or part of these specifications in their products. <br /><b>Implementation includes the ability to use the specification documentation <br />for analysis and forensic reference purposes</b>. Microsoft Office Drawing File <br />Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are <br />also available under this program.<br /></pre><br /><p>No right to create a competing product.</p><br /><p>As for whether Microsoft responds at all to any such documentation request, it remains to be seen. Last but not least, whether the documentations contain material that are required even for just analysis or forensic purposes remains to be seen. Microsoft used to distribute those documentations as part of the MSDN Library, until February 1998 (right when Mr Gates had a pinch for Office documents viewed in Internet Explorer). Those documentations are incomplete, often just descriptive, and full of typos. In fact, just as the ECMA 376 documentation itself.</p><br /><p><B>Now enter BIFF11+</B>. Yes, you heard that right! Microsoft created not just one new BIFF file format, it has also taken the time to extend BIFF11 (i.e. Excel 2003's BIFF file format) to include new BIFF records. What for ?</p><br /><p>The reason is round-tripping of spreadsheets. Imagine you are creating a new Excel 2007 spreadsheet using some of the new features, such as the formatting databar (a bar chart drawn inside cells). Then save this file as a Excel 97-2003 compatible file in order to facilitate the collaboration with others (not using Excel 2007). What you still expect is that this file preserves those features. That's exactly what BIFF11+ does, just that Microsoft hasn't considered documenting those new BIFF records, not even talking about it by the way.</p><br /><P>To reproduce the scenario :</P><br /><UL><LI>start Excel 2007 and create a new spreadsheet (or open an existing spreadsheet).<LI>enter values 10,20,30,40,50 in separate cells, then click on the ribbon on the Style button, Conditional Formatting, Data bars, and then pick a color gradient.<li>save the spreadsheet as a <code>Excel 97-2003 Workbook (*.xls)</code><li>a prompt warns that there is going to be a significant loss. Click Continue.<li>open the .xls file with any older version of Excel (97/2000/XP/2003)<li>the file opens with no prompt, and the databar does not show up (which was expected).<li>enter value 60 next to the other cells, and save the file as a regular .xls file, just like it is.<li>open this file in Excel 2007 : the databar magically shows up. The databar however does not include the value we just entered, meaning that we have created a discrepancy.</ul><br /><p>How can the databar show up at all in Excel 2007 since we went through two independent .xls writing phases (one with Excel 2007, one with Excel 97/2000/XP/2003) ?</p><br /><p>You guessed it, new BIFF11+ records are created and preserved. None of that is documented, meaning that this round-trip scenario only works with Microsoft Office. It works as follows :</p><br /><ul><li>when Excel 2007 creates a Excel 97-2003 workbook (*.xls), it actually creates a BIFF11+ file. Because BIFF11+ is an extension of BIFF11, it can be opened in Excel 97, 2000, XP and 2003.<li>and because how BIFF works (any record in the header and footer of internal sections is preserved as is), it allows to do a save in Excel 97, 2000, XP and 2003 without killing those new records.<li>since Microsoft created such new BIFF records to represent the new features : theme, databars, new visual candy in charts, ... but none of it is documented, it follows that interoperability across those file formats is exclusive to them.<li>BIFF11+ records are not a subset of BIFF12 records. Implementers will have to implement/preserve BIFF11+ records and BIFF12 records.<li>we have a discrepancy in the databar that is left for actual spreadsheet users to worry about. The value 60 is left disconnected to the formatting databar, even in Excel 2007.</ul><br /><p>To close up the discussion about BIFF, let's just take a look at an example of BIFF11+, actually just an excerpt from a file (in red, the new BIFF11+ records) :</p><br /><pre style="background-color:#EEEEEE"><br />[DIMENSIONS 000E] 01 00 00 00 06 00 00 00 01 00 02 00 00 00<br />[ROW 0010] 01 00 01 00 02 00 2C 01 00 00 00 00 00 01 0F 00<br />[ROW 0010] 02 00 01 00 02 00 2C 01 00 00 00 00 00 01 0F 00<br />[ROW 0010] 03 00 01 00 02 00 2C 01 00 00 00 00 00 01 0F 00<br />[ROW 0010] 04 00 01 00 02 00 2C 01 00 00 00 00 00 01 0F 00<br />[ROW 0010] 05 00 01 00 02 00 2C 01 00 00 00 00 00 01 0F 00<br />[FLOAT 000A] 01 00 01 00 0F 00 00 00 24 40<br />[FLOAT 000A] 02 00 01 00 0F 00 00 00 34 40<br />[FLOAT 000A] 03 00 01 00 0F 00 00 00 3E 40<br />[FLOAT 000A] 04 00 01 00 0F 00 00 00 44 40<br />[FLOAT 000A] 05 00 01 00 0F 00 00 00 49 40<br />[DBCELL 000E] AA 00 00 00 50 00 0E 00 0E 00 0E 00 0E 00<br />[WINDOW2 0012] B6 06 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00<br />[SELECTION 000F] 03 01 00 01 00 00 00 01 00 01 00 05 00 01 01<br />[00EF 0006] 00 00 37 00 00 00<br /><font color=red><b>[0867 0017] 67 08 00 00 00 00 00 00 00 00 00 00 02 00 01 FF FF FF FF 03 44 00 00</b></font><br /><font color=red><b>[089C 0026] 9C 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <br />00 00 00 3C 33 00 00 00 00 00 00 00 00</b></font><br /><font color=red><b>[088B 0010] 8B 08 00 00 00 00 00 00 00 00 00 00 00 00 02 00</b></font><br /><font color=red><b>[0879 0022] 79 08 01 00 01 00 05 00 01 00 01 00 01 00 03 00 01 00 05 00 01 00 01 00 01 <br />00 01 00 05 00 01 00 01 00</b></font><br /><font color=red><b>[087A 004C] 7A 08 00 00 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 00 <br />00 01 01 00 03 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2A 00 00 01 0A 5A 02 00 00 00 FF 55 5A <br />FF 00 00 00 00 00 00 00 00 02 00 00 03 00 00</b></font><br />[EOF 0000] <br /></pre><br /><p>We thought BIFF was gone, and in fact now we have BIFF11+ and BIFF12. Great progress on the interoperability front...</p><br /> <P> </P><br /> <H2>13) Document backwards compatibility subject to neutrino radio-activity</H2><br /> <p>Here is a <a href="http://xlsgen.arstdesign.com/special/Chart97_BreakingChange2007.xls">simple chart</a> created with Excel 2003. Open it with any of Excel 97/2000/XP/2003, then open it in Excel 2007. Microsoft said they could not change the internal structure of Excel spreadsheets (and other Office document types) because they had to provide 100% full fidelity otherwise their customers would not want it. Is this true?</p><br /><table><tr><br /><td><img src="http://www.codeproject.com/KB/office/ooxml_is_defective/ChartExcel2003.jpg"><br><i>A simple chart in Excel 2003</i></td><td><img src="http://www.codeproject.com/KB/office/ooxml_is_defective/ChartExcel2007.jpg"><br><i>Same file opened in Excel 2007</i></td><br /></tr></table><br /><p>The differences are :</p><br /><ul><li>vertical axis all set to automatic scale/min/max.<br /><li>also impacts the number of<br />horizontal gridlines in the background.<br /><li>chart title font not the same weight<br /><li>chart title incorrectly positioned vertically.<br /><li>legend border incorrect.<br /><li>legend entries incorrectly positioned.<br /><li>spacing between the plot area and the legend.<br /></ul><br /><p><sarcasm>Programs used by hundreds of millions people need no special attention, that goes without saying....</sarcasm></p><br /><p>Good luck programming charts. Microsoft provides no documented mapping between the Escher library (pre-Office 2007 era) and the DrawingML+VML libraries (Office 2007 era, before it gets dropped to something else in the future). Is there possibly a reason why?</p><br /><p>Answer : Microsoft has dropped the existing chart drawing engine in favor of a new library. It is impossible in practice to rewrite a library with new source code, <b>a new agenda (visual candy)</b>, and manage to get 100% full fidelity with the past. This results in unproper drawing of <b>existing</b> charts. Which means, in other words :</p><br /><ul><li>Microsoft is lying, the part of ECMA 376 documentation describing charts does not 100% map the old chart drawing spec.<br /><li>The implementation itself is buggy, an automatic consequence of the first point<br /></ul><br /><P> </P><br /> <H2>14) ECMA 376 documents just do not exist</H2><br /> <P>Last but not least, ECMA 376 documents just do not exist. <br /> The reason why is manyfolds :</P><br /> <UL><LI>Office 2007 documents are derived from theoretical ECMA 376 documents, to which is added <a href="http://www.codeproject.com/cs/library/office2007bin.asp">binary parts</a>, macros, OLE objects, ActiveX serialization, DRM, sharepoint metadata, ...<br /> <LI>Office 2007 documents are incompatible with theoretical ECMA 376 documents, since the ECMA 376 documentation says among other things that VML is deprecated, and Office 2007 documents are still using plenty of it when creating new documents.<br /> <LI>Microsoft strategy itself is to provide as little information as possible about the huge impedance mismatch between ECMA 376 and the actual implementation. Expect ECMA 376 to evolve only marginally, while Office 2007 next version will come with plenty more of Microsoft proprietary layers, especially those integrating the Microsoft Office suite with Windows (on the client), more undocumented integration points between Microsoft Office, Windows and Sharepoint (on the server), more undocumented client-server protocols (between the Microsoft Office client, and servers running Microsoft server software).</LI></UL><br /><p>These are blatant elements why the ISO submission is actually a diversion. If you think the elements above are not enough, consider the following <a href="http://blogs.msdn.com/brian_jones/archive/2007/07/12/spreadsheet-formula-bugs.aspx">response</a> from Brian Jones, Microsoft employee and authoritative voice on Microsoft Office XML formats : (emphasis mine)</p><br /><pre style="background-color:#EEEEEE"><br />To your last point [It would be good if Microsoft would state offically it's intent <br />to support future development and improvement of the standard in Ecma of new version <br />of the format and that it intents those version to get simular open licensing.], <br /><B>it's hard for Microsoft to commit to what comes out of Ecma</B> in the coming years, <br />because we don't know what direction they will take the formats. We'll of course stay <br />active and propose changes based on where we want to go with Office 14. At the end of <br />the day though, the other Ecma members could decide to take the spec in a completely <br />different direction. Now my impression is that won't happen, as the folks on the TC [<B>Ecma<br />Technical group Committee, in which Microsoft is the...armchair</B>] all have pretty similar <br />visions for the future of the spec, but since it's not guaranteed it would be hard for <br />us to make any sort of official statement.</pre><br /><br /> <P> </P><br /><H2>15) How the ISO OpenDocument format (ODF) compares?</H2><br /><p>When I wrote this article, I wanted it to focus on Microsoft Office XML formats exclusively, but Microsoft apologists out there running out of arguments to find a justification as to why Microsoft Office XML formats are that bad, turned to the typical rhetoric, started sending piques at the ISO OpenDocument format (ODF). I thought, well, let's do a quick comparison on all the points above. Results are as follows :</p><br /><ul><li><i>Self-exploding spreadsheets</i> : I have created a <a href="http://diffopc.arstdesign.com/diffopc_sample_odf_2/diffopc.html">visual diff</a> of two ODF spreadsheets, before and after the modification of a cell containing a formula. And sure enough, there is no calculation chain, which means it's easy to make manual changes to existing ODF files, no hassles, you don't have to care about the format's own dirty laundry. And to remove the formula, you just remove the corresponding XML formula attribute in that cell. <code><font color=red><b>ODF : 1, ECMA 376 : 0</b></font></code><br /><li><i>Entered versus stored values</i> : the entered values are stored as is. There is no rounding error. ODF goes as far as storing both the entered value and the displayed value (after the number format is applied) which is a wonderful thing for those willing to extract data. <code><font color=red><b>ODF : 2, ECMA 376 : 0</b></font></code><br /><li><I>Optimization artefacts</I> : there is no such thing as a shared formula concept in ODF. I guess ODF needs not be encumbered with optimization techniques that were valid 20 years ago, when computers were a thousand times less powerful than right now. <code><font color=red><b>ODF : 3, ECMA 376 : 0</b></font></code><br /><li><I>VML isn't XML</I> : there is no VML or anything fundamentally non-XML in ODF. If you create a note in an ODF file, the corresponding XML fragment is :<br /><pre style="background-color:#EEEEEE"><br /><office:annotation office:display="true" draw:style-name="gr1" draw:text-style-name="P1" <br /> <b>svg:width="2.899cm" svg:height="0.596cm" svg:x="7.374cm" svg:y="0.307cm"</b> <br /> draw:caption-point-x="-0.61cm" draw:caption-point-y="1.5cm"><br /> <dc:creator>sr</dc:creator><br /> <dc:date>2007-09-02T00:00:00</dc:date><br /> <text:p text:style-name="P1">some comment</text:p><br /></office:annotation><br /></pre><br />This is proper XML, one value per attribute, the fragment is full in-context in the stream. Plus, you can grab an existing SVG library to make sense of the XML, no need to implement a single vendor's soup. <code><font color=red><b>ODF : 4, ECMA 376 : 0</b></font></code><br /><li><i>Open packaging parts minefield</i> : there is no such thing as a tree of parts. Instead, ODF has a central directory whose management is deterministic and in fact trivial. In retrospect, why did Microsoft went with such a problematic concept of a tree part with undeterministic relations? Was it because strictly copying ODF was embarassing (Not-Invented-Here syndrome) ? <code><font color=red><b>ODF : 5, ECMA 376 : 0</b></font></code><br /><li><I>International, but US English</I> : everything is stored <i>uniformly</i> in US English, there is no discrepancy. The locale is applied as per the style definitions, separate from the data, which is a good design in decoupling that implementers can take advantage of (smaller implementations, less prone to errors, especially in a global market context). <code><font color=red><b>ODF : 6, ECMA 376 : 0</b></font></code><br /><li><I>Many ways to get in trouble</I> : ODF defines a single style concept across the entire formats. So not only any Word, Spreadsheet or Presentation document has a single style concept instead of many. But both document types share the same style concept. This translates in interoperable formatting across applications and platforms, plus small implementations. Contrary to ECMA 376, implementers don't have to implement more than 15 years of formatting legacy. <code><font color=red><b>ODF : 7, ECMA 376 : 0</b></font></code><br /><li><I>Windows dates</I> : ODF stores ISO 8601 dates, which makes it easy for implementers to interoperate on any application or platform. Here is the following XML fragment when entering date <code>11/01/2001 03:45:20</code> : <code><font color=red><b>ODF : 8, ECMA 376 : 0</b></font></code><br /><pre style="background-color:#EEEEEE"><br /><table:table-cell table:style-name="ce1" office:value-type="<b>date</b>" <br /> office:date-value="<b>2001-01-11T03:45:20</b>"><br /> <text:p>11/01/2001 03:45:20</text:p><br /></table:table-cell><br /></pre><br /><li><I>All roads lead to Office 2007</I> : ODF also destroys any custom part name. <code><font color=red><b>ODF : 8, ECMA 376 : 1</b></font></code><br /><li><I>A world of ZIP+OLE files</I> : After encryption, an ODF file is still an ODF file (i.e. ZIP, XML). ODF encrypts the content part (using a standard algorithm described in the manifest part). The metadata part is left unencrypted. <code><font color=red><b>ODF : 9, ECMA 376 : 1</b></font></code><br /><li><I>Document security</I> : ODF does the same, it saves a password hash that can by manually removed. <code><font color=red><b>ODF : 9, ECMA 376 : 2</b></font></code><br /><li><I>BIFF is gone</I> : ODF is ZIP and XML. If you insert or import an OLE object, it's saved as a binary part, just like ECMA 376 does, but the ODF streams themselves are still XML. <code><font color=red><b>ODF : 10, ECMA 376 : 2</b></font></code><br /><li><I>Document backwards compatibility</I> : I've installed the oldest OpenOffice release I could get (1.0.3, released in 2003), loaded the same chart as in the example, saved as an .SXC filme (this was the name extension before it became an .ODS file later), then loaded the .SXC file in the OpenOffice 2.2. The chart is exactly the same in both versions. <code><font color=red><b>ODF : 11, ECMA 376 : 2</b></font></code><br /><li><I>ECMA 376 documents do not exist</I> : No hidden intellectual property since it's a 100% open-source project in GPL. <code><font color=red><b>ODF : 12, ECMA 376 : 2</b></font></code><br /></ul><br /><H3>Final score : <code><font color=red><b>ODF : 12, ECMA 376 : 2</b></font></code></H3><br /><p>To better cope with those problems, ECMA 376 will have to be redesigned to something that is pretty much what ODF does. Why isn't Microsoft extending ODF instead of reinventing a bad wheel ? Customers would benefit a great deal of a single standard.</p><br /> <P>-Stéphane Rodriguez, <i>August 2007</i><br>Independent software vendor, file format expert<br>Not affiliated to any pro-MS or anti-MS party/org/ass.</P><br /> <P> </P><br /> <i>Update : this article was Slashdotted on Sunday 26 of August.</i><br /> <i>Update2 : this article is taking 300,000 hits a day, and is making it all around the world in all kinds of sites. My web host provider was so angry at the peak in traffic that he threatened to cut me off, so I had to redirect to a blog site such as Google's blogger to host the article.</i><br /> <i>Update3 : wednesday august 29, added a new section on Document security</i><br /> <i>Update4 : friday august 31, added more content to sections US English and Windows dates</i><br /> <i>Update5 : sunday september 2, added a quick comparison between ODF and ECMA 376</i><br /> <P> </P>OOXML is defective by designhttp://www.blogger.com/profile/00378690876107705438noreply@blogger.com