OOXML is defective by design

Thursday, May 22, 2008

Microsoft latest bullshit : native support of ODF in Office 2007

Stéphane Rodriguez, May 2008

Previous articles :
- Custom XML? What Custom XML?
- Backwards compatible? One more lie by omission
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


I wanted to post a quick reaction to the latest Microsoft bullshit announcement, in which they reportedly plan to "add native support for ODF 1.1". The way they put is very succinct, intentionally probably, and it opens the door for wild guesses.

First of all, Microsoft is a huge Office licensing monopoly. It's so big it even surpasses Windows in sales. Any decline in Office licensing would be dramatic for Microsoft's future. With that alone, you know that any announcement from Microsoft that they are willing to interoperate with other people's software, namely applications, should be taken with a grain of salt.

Here is how, with the release of Office 2007, Microsoft intends to keep their monopoly in Office licensing :

Phase 1 - as long as there is not enough Office 2007 documents out there, make sure that customers understand that only Office 2007 can reliably migrate binary files to the new file formats. Hence the backwards compatibility claim which are part of the OOXML ISO marketing diversion (ironically inflated by critics).

During this phase, which began in 2006 and should take at least 5 years (at least one major organic corporation upgrade cycle), Microsoft bottom line is at risk. The strategy was therefore to make sure to retain the exclusivity when it comes to migrating file formats, and spending money to get this message heard well by the customers. This is exactly what the infuriating OOXML ECMA and then OOXML ISO was about, not anything remotely related to an international standard meant to "improve interoperability across platforms".

Technically speaking, only Microsoft can reliably migrate binary files since only they know the implementation required to do so. Have you noticed that Office 97 shipped 11 years ago, and we have yet to hear about a non-Microsoft application that would strictly interoperate with those files? Besides this, the so-called interoperability documents that were made available back in February 2008 are a farce : this is a bare update of the old MSDN documentation, and everything that was undocumented back in Office 97 days remains exactly as much undocumented. No improvement was made with this, as I explained in a previous article, but sure enough Microsoft exploited it to lure the sheeps who did not actually read it. The real message : keep buying Office 2007 licenses, otherwise you will be hurt. Or, somewhat more verbose, if you deploy non-Microsoft Office software, soon enough you'll get document fidelity issues which will damage your business. We provide the only application that ensures full-fidelity, so it's suicidal to use competing products. In other words, competitor exclusion.

Needless to say, this fire and motion strategy has been going on for two decades. Microsoft adds a number of features in their spaghetti codebase, announces it when it ships the product, then competitors have to catch up (years of work usually) instead of concentrating on their products capabilities. The irony is, with Office 2007 for instance, Microsoft has been themselves guilty of lack of full-fidelity with the chart engine that they replaced with a new one killing a number of existing features of the old one (rendering and chart options).

Phase 2 - there is enough Office 2007 documents out there. Game over.



With that said, a few more words.

In regards to competition between file formats, it has to be understood that the point of Microsoft is to ensure that going forward their internal representation of a file in memory remains what encompasses every other file format. In other words, ODF will become a second-class citizen and it is expected that it will be exploited to downgrade or significally hamper the fidelity of open/save ODF files. Classic Embrace, Extend, Extinguish. Likewise, since Office 2007 is not a native XML application (the internal representation is a bunch of binary structure, not XML DOM), it will never be able to become a basis for ODF-based applications that would really take advantage of XML. In other words, XML itself is being deceptively exploited in order to simply preserve the monopoly.

In regards to scenarios that would be enabled by "adding native support for ODF 1.1", what about external references from an ODS file to a XLS/XLSX file? What about copy/pasting back and forth preserving data, context and formatting? The list goes on and on.

"native support for ODF 1.1" would also imply that the open source projects that were heavily part of the marketing machinery were just that, marketing. Of course, the message was heard by corporate customers since day one (to them "open source projects" means "unsupported"), so there is nothing really surprising here.

Last but not least, with ODF 1.2 just around the corner and its dramatic improvements, there has never been a more strategic time for an enemy of ODF to announce support for ODF...1.1. In case you think the last sentence contradicts itself, consider the fact that the massive Microsoft distribution power amassed from two decades of pushing hard Microsoft Office onto everyone's desktop (reportedly 500+ million licenses) guarantees that whatever version of ODF shipping with Microsoft Office, it will be very hard for those actually contributing to ODF in the OpenOffice project to migrate the user base to whatever next version of ODF, say ODF 1.2. This is how you can make any change to ODF and are unable to get it in the hands of the user base. Yet another win for Microsoft.

Friday, March 21, 2008

Custom XML? What Custom XML?

Stéphane Rodriguez, March 2008

Previous articles :
- Backwards compatible? One more lie by omission
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


Other than backwards compatibility, the expression "Custom XML" plays an important role in Microsoft ISO OOXML evangelism. It's interesting that Microsoft bloggers don't even seem to be embarassed by ridiculous expressions such as "Custom XML". Custom XML is indeed just as silly as "Office Open XML" : the reason is X in XML already means Custom. So there cannot be a meaningful sense for Custom XML...unless Custom XML is short for Custom XML applications. And from that it makes sense. But there is a problem : Custom XML is part of the ISO proposal of a document file format, whereas Custom XML applications implies the apparatus and logic related to applications, inherently tied to products, platforms and operating systems, not documents. Did Microsoft feel guilty, removed the word applications knowing it wouldn't stand a chance otherwise ? That's for anyone to guess. In this article, we are going to delve into the so-called "Custom XML", and how little useful it is in practice.



Custom XML definition, as per Microsoft


Straight from the horse behind it, Brian Jones :

Custom XML is the support for custom defined schemas. It's that support that allows you truly integrate your documents with business processes and business data. You can define your data using XML Schema syntax, and then you can use that data in your Office documents. By opening up our formats with our reference schemas, and supporting your custom defined schemas, you get true interoperability of your documents. Sorry if this is currently sounding more like a marketing pitch, but I wanted to make sure I reiterated our vision for XML support in Office documents and hopefully that will help you see the power that we see. (...) Up until now we've talked about all the parts that we in Office have defined to create our documents. You as a developer also have the ability to add your own parts though. You can take any XML file and put it inside the ZIP package.


Translation : by XML, we actually mean several completely different things, and we've put all of it in the same pot. We think storing XML-based data inside the ZIP package is an efficient way to share your confidential corporate data to the outside world and we are proud to make it easy to do so. Likewise, storing business data and the document together illustrates the grand Microsoft vision about independent layers.


Straight from the marketing people at Microsoft, here Doug Mahugh (answering Patrick Durusau during the INCITS V1 review of OOXML back in April 2007) :

DIS 29500 (OOXML) serves other purposes that are not served by ISO/IEC 26300 (ODF), especially in the area of integration options for external schemas. (By "external" I mean schemas that are not part of the spec itself -- in common usage we tend to call these "custom schemas" as opposed to the "reference schemas" in DIS 29500).


Translation : at Microsoft, we ship products to make it possible to do what you can already do without.


Straight from a top Microsoft Office brass :

Open XML allows for custom XML markup within the body of a document which is a handy way to allow users to tag their content for interoperability with other types of software such as a custom line of business system.


Translation : we dare you put foreign XML markup into an existing XML, with no agreed-upon semantics between the two languages, in order to improve the interoperability. What kind of interoperability we are talking about is left as an exercise to the reader.



Enough marketing fluff. What it really is.


First, let's get out of the way that "Custom XML" actually means "Custom" "XML" at all.

  • Start Word 2007.
  • Create a new document.
  • Type "test".
  • Save and close the document.
  • Unzip it.
  • Grab the part word/document.xml, you should see the following :


    <w:p>
    <w:r>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Now add some custom XML markup :


    <w:p>
    <w:r>
    <w:t mytag="myvalue">test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007. It opens perfectly well.
  • Close it. Unzip the file again. Grab the part word/document.xml, you should see the following :


    <w:p>
    <w:r>
    <w:t mytag="myvalue">test</w:t>
    </w:r>
    </w:p>

  • Now remove the custom attribute and instead add some other custom XML markup, this time an element :


    <w:p>
    <w:r>
    <mytag>myvalue</mytag>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007. This time it brings the following error message :


    Corrupting the document by adding custom XML markup

  • Perhaps this is a namespace issue. Let's prefix our custom XML with w so that the fragment looks like the following :


    <w:p>
    <w:r>
    <w:mytag>myvalue</w:mytag>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007, and indeed this time it opens well.



From those simple tests, we can infer the following :

  • There is no such thing as strict XML schema validation in Word 2007.

  • You can add a custom XML attribute with no namespace (therefore the attribute uses the element's namespace, in our case w), and Word 2007 will not complain even though the mytag attribute is not part of the OOXML reference schemas. See ECMA 376, part 4, page 199, section 2.3.2.23, the w:r element is defined as follows :


    <complexType name="CT_R">
    <sequence>
    <group ref="EG_RPr" minOccurs="0"/>
    <group ref="EG_RunInnerContent" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
    <attribute name="rsidRPr" type="ST_LongHexNumber"/>
    <attribute name="rsidDel" type="ST_LongHexNumber"/>
    <attribute name="rsidR" type="ST_LongHexNumber"/>
    </complexType>

    and therefore does not allow the mytag attribute to be present.

  • You cannot add a custom XML element unless it is prefixed by one of the namespaces declared in the header of the XML document, in our case w (xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"). Rather interesting again since w:mytag does not belong to the OOXML reference schemas, and yet Word 2007 does not complain about it.


Needless to say, this is a broken implementation of XML that neither satisfies the "strict XML" camp, nor the "loose XML" camp.

It can be summarized by the following table :






Word 2007
TestResultStrict validation
mytag="myvalue"OKNO
<mytag>myvalue</mytag>FAILN/A
<w:mytag>myvalue</w:mytag>OKNO


And for the other two applications, Excel 2007 and Powerpoint 2007, the results are as follows :






TestWord 2007 resultExcel 2007 resultPowerpoint 2007 result
mytag="myvalue"OKFAILFAIL
<mytag>myvalue</mytag>FAILFAILFAIL
<w:mytag>myvalue</w:mytag>OKFAILFAIL


Since Word 2007, Excel 2007 and Powerpoint 2007 do not handle Custom XML the same way, as the table above shows, it's very hard to trust Microsoft when they claim that Custom XML is a feature of OOXML.

Either application will see the document as a corrupt one is guaranteed to be random, Microsoft cannot possibly imply that the Custom XML they are talking about is what we take for granted when we say "Custom XML", i.e. the ability to add our own XML within the document.



Custom XML, as per ECMA 376


Since "Custom XML" does not mean "Custom" "XML", we have to rely on ECMA 376's definition of such thing. The first surprise is that there is a notion of "Custom XML markup" and a notion of "Custom XML data".

The second surprise is that the notion of "Custom XML markup" only appears in the documentation for Word documents. If we assume that this "Custom XML markup" will be used to bind the "Custom XML data" to the document at run-time, we can infer from ECMA 376 that this is only made possible for Word documents.

In other words, whatever "Custom XML" is, it is only fully implemented for Word, so it should be called "Custom XML in Word". When Microsoft marketing people are trying to sell us "Custom XML" as a feature of OOXML, it is a lie. Let's create a table of what we have just learned.





 Word 2007Excel 2007Powerpoint 2007
Custom XML dataYESYESYES
Custom XML markupYESNONO


"Custom XML data", ECMA 376 part 4 section 8, supported in both documents, is the ability to store an independent XML stream in the ZIP package. In fact, it is not a feature of OOXML at all, it is a feature of any ZIP archive. After all, a ZIP entry in a ZIP package can be anything, including an XML stream. We can infer from that, that "Custom XML data" is in fact nothing custom : the ability to store an independent XML stream is not something we should thank Microsoft for allowing us to do so.

"Custom XML markup", ECMA 376 part 4 section 2.5, supported only in Word documents, is the ability to bind the "Custom XML data" to the document's content, at run-time. Interestingly enough, the Word team at Microsoft haven't quite managed to merge this concept and the old "smart tag" concept. That's why in the ECMA 376 specification, we end up with several flavors of "Custom XML markup", one of which is smart tags, the ability to add metadata to content (eg. stock quote).

An interesting element is "run-time". If you write the Custom XML markup that make it possible to do the said data binding, it has to be reminded that the data binding is done by an instance of Word, not a third-party application. So data binding is just a lock-in. That's the difference between standardizing on a document versus standardizing on an application!

There is no mechanism for doing such thing in Excel spreadsheets and Powerpoint presentations. In Excel spreadsheets, the XML data binding, a feature available from the user interface, is a special case of data source querying where the data source is an XML stream. The XML stream is external to the ZIP package. In other words, the "Custom XML data" in spreadsheets is useless. In Powerpoint presentations, it's even more trivial since there is no such thing as an XML data binding mechanism from the user interface.



The merit of "Custom XML data"


Something interesting to note is that Microsoft thinks that storing data inside the ZIP package independently of the document is a good thing. From a pure technical point of view, you can view this "Custom XML data" as a cache of values thanks to which the consumer is able to drill into the data without a connection to the actual data source (corporate data). But there is a major flaw. Anybody using this feature will end up storing arbitrary data in ZIP packages shared across colleagues and others inside and outside the organization. Eventually, confidential information from the corporate databases will end up there, and a PR disaster automatically follows. You don't want to use this feature.



Conclusion


"Custom XML" does not mean much, despite Microsoft ample evangelism of said feature. Technically speaking it has no merit within the enterprise space because you end up sharing corporate data. An interesting fact is that "Custom XML" is actually only implemented in Word 2007. For instance, the ECMA 376 specification provides a data binding for Word 2007 documents, exclusively. Ironically enough, the ability to store an independent XML stream as part of a ZIP package, is just a feature of the ZIP library, not Microsoft's innovation.

Wednesday, March 19, 2008

Backwards compatible? One more lie by omission

Stéphane Rodriguez, March 2008

Previous articles :
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


Ever since Microsoft Office XML file formats were introduced, the excuse for using such poorly engineered markup was that it was designed to be backwards compatible with the so-called billions of existing documents out there.

Is it really the truth? Or is Microsoft using it to fuel more fire and motion?


Lying by omission, as per Wikipedia :

Lying by omission is when an important fact is omitted, deliberately leaving another person with a misconception. This includes failures to correct pre-existing misconceptions. One may by careful speaking contrive to give correct but only partial answers to questions, thus never actually lying.



Backwards and forwards compatibility


What's backwards compatibility first of all? For everyone in the world, except for Microsoft, file backwards compatibility means that product in version N can work seamlessly with files produced by the product in version N+1. For Microsoft however, it's the opposite, backwards compatibility means that product version N+1 can work seamlessly with files produced by the product in version N. By "work seamlessly" is meant no visible degradation, no noticeable loss in functionality.

To claim that a file format is backwards compatible is therefore quite an ambitious goal, especially if the file format in question is complex. It's so ambitious that it's almost an impossible task in practice. In fact, it behaves much like a theorem in mathematics. To prove the theorem is wrong, you only need one counter-example. To prove backwards compatibility is not met, you just need to come up with one counter-example.

So let's take a simple chart and see how backward compatibile it is.

  • Start Excel 2003 (or any Excel version older than Excel 2007)
  • Type jan, hit <TAB>, 3, <TAB> 4, <TAB>, 5
  • Select those cells, click the chart creation button in the main toolbar and choose a column type bar


The chart shows up like this :


A simple chart made with an older Excel version

Now resize the chart, and you can see that everything gets scaled accordingly : text in all chart elements, plot area, ...


Resizing a simple chart made with an older Excel version

  • Now start Excel 2007
  • Open this file
  • The chart shows up



The chart file (.xls) opened in Excel 2007

Now resize the chart, and you can see that fonts don't scale :


Resizing the chart file (.xls) in Excel 2007 does not scale the fonts

If you would like to resize the fonts, you have to select each chart element individually, and then apply the font size with user interface elements. Quite counter-productive. In addition, since that's manually done, proportions are not kept between chart elements and you cannot easily reproduce the same effect than with older versions of Excel.

Now if you save this file as a .xlsx file using Excel 2007, and then convert it back to a .xls file, then open it with an older version of Excel (effectively executing a round-trip scenario), the chart shows up but it does not scale when it is resized :


After the conversion back to (.xls), the fonts do not scale if we resize the chart




So much for backwards compatibility...

In case you wonder what's going on, what happens is that Excel 2007 does not support the auto scale option, something very pervasive in chart formatting options. Here is the guilty option :



Root of the problem : Excel 2007 does not support "auto scale" anymore (screenshot from Excel 2003)

It is not only a feature that is gone from the user interface, it is an option that is lost in the document itself. In other words, sharing this file with any person inside or outside your company will result in discrepancies.

For your information, the internal Excel BIFF record that Excel 2007 does not support is the FBI record (part of the chart records, and therefore left unspecified in Microsoft's binary BIFF documentation, as explained in my previous article) :

(excerpt from MSDN Library, Feb 1998)

FBI: Font Basis (1060h)
The FBI record stores font metrics.

Offset | Name | Size | Contents

4 | dmixBasis | 2 | Width of basis when font was applied

6 | dmiyBasis | 2 | Height of basis when font was applied

8 | twpHeightBasis | 2 | Font height applied

10 | scab | 2 | Scale basis

12 | ifnt | 2 | Index number into the font table



If you are interested in taking a look at the corresponding Excel files :



Note that this is just one simple example. There are many examples of all kinds. For instance, in a previous article, I showed how encrypting a document did encrypt the metadata as well, contrary to older Office application versions, and doing so breaks all applications expecting to access the file's metadata whether the file is encrypted or not.

As for forwards compatibility, the same article already showed (BIFF11+ section) that any Excel 2007 feature would be lost or partially lost when converted back to older Excel versions and part of a round-trip scenario. So forwards compatibility is not true either generally speaking.



Behind the scene, what's happening?


Behind the scene, what's happening is that Microsoft replaced the chart subsystem in Excel 2007 with a new library but failed to test it enough, presumably because they wanted to rush Office 2007 out the door. Since charts can also be linked to Word and Powerpoint documents, this actually affects all 3 applications' backwards compatibility.

The real mistake they made is to kill the old chart subsystem source code. What they should have done to meet their backwards compatibility claim was to use the old chart subsystem for the compatibility mode (old Excel files), and the new chart subsystem for charts in new files. There is no other way they can achieve the claims.



If backwards compatibility is not proven technically speaking, what is Microsoft trying to achieve?


In the previous section, it has been demonstrated with a simple counter-example that backwards compatibility leaves much to desire. So if this capability is not technically sound, how come Microsoft keeps hammering it every single day?

A Microsoft Office spokeperson made the following statement on March 16, 2008 :

An Open Letter from Chris Capossela, Senior Vice President, Microsoft Office

In today's digital world, there is a critical need to access, share and archive information. People want to share information with co-workers, business partners, family members and others regardless of the technology platform or software application being used. People want to be able to store and archive documents so that the information they contain can be accessed well into the future. Long-term access to documents is particularly important in the case of governments given the critical and historical nature of government information and the associated requirements for preservation and access moving forward.

Why choose the Ecma Office Open XML File Formats?
Office Open XML fulfills an important customer need - the file format was designed to be backward compatible with the content and functionality in billions of existing documents. (...)


The same thing has been hammered a gazillion times by Microsoft since the "new" file formats were announced back in 2005. If you are a software developer, then it does not sound like anything special. In fact, you may even think there is no substance at all in it. But if you wear a CIO cap, it's a different thing.

Remember, IT people are responsible for running the systems. For them backwards compatibility is not heard in the technical sense, it is synonym of "won't break systems, no down time". See where it goes?

A little more help. Microsoft, by talking over and over about backwards compatibility is actually telling decision makers that they are pledging not to break systems when everyone in the organization uses the applications and those new file formats (remember that for anyone, .docx .xlsx and .pptx files are alien files unless they install Office 2007).

In fact, it goes a little deeper. And that's when it gets really interesting. Microsoft is not only saying their stuff is backwards compatible, they are actually saying that they are the only backwards compatible when it comes to Office documents.

It is another way to say that the only way to ensure no down time is to purchase licenses of Microsoft Office software. As simple as that.

To better understand the picture, consider the following. During the 2007-20xx period, Microsoft is taking a major risk by introducing new file formats that cannot be natively used by older versions of Microsoft Office. It's something they did back in 1995 and it did not go very well. If you remember, this caused havok in IT organizations because Office 95 and Office 97 file formats were substantially different and Office 95 could not open Office 97 documents. Microsoft did break file formats unprepared.

So it is without surprise that Microsoft is trying to make sure that CIOs get the message right this time. That is the reason why Microsoft has been investing so much in all kinds of additional components :
  • Office 2007 compatibility mode
  • Microsoft Office compatibility pack
  • Office migration manager
  • Office MOICE


By saying what they say, Microsoft is trying to keep the competition aside, despite the very vocal "open standards" claims. The "open standards" claims is actually a diversion. "open" is not in Microsoft Office DNA and culture. For instance, nobody in that team ever said publicly that the only reason we are dealing with VML in the file formats is because they rushed Office 2007 out the door (presumably to ship simultaneously with Windows Vista), and as a result they did not fully transition to DrawingML. All this debate about VML out there is entirely the result of Microsoft's own incompetence in shipping a finished product. Simply shameful because third parties now have to support both VML and DrawingML. There would not be such surprise with a company that does things "openly".

In the period of time (that spans through a number of years) where Microsoft takes the risk of introducing new file formats, CIO could decide to just as well switch to an alternative Office suite for a fraction of the budget. Microsoft knows it very well, hence the media blitz. Hence the characterization of the alternative Office file formats such as ODF as simplistic. Simplistic is supposed to mean : not backwards compatible. That is what CIOs understand, and that is FUD.

In this period of time, Microsoft wants to minimize the ratio of organizations switching to an alternative Office suite. To Microsoft Office, OpenOffice is Netscape. This is not a problem anymore when the new "OOXML" file formats reach critical mass. We are not there yet, but that's the plan : dry the competition as much as possible, which includes fake activities such as starting so-called open source projects. Something that is not exactly music to CIOs' hears, why should CIOs care? It's the opposite. When a CIO hears "open source project", he understands something other than backwards compatible, therefore unreliable stuff. The decision is a no brainer then.

So what we are talking about here is a well orchestred media communication targeted to CIOs, not software developers. Until you get that, you don't understand.



The consequences of this attitude


Now that we know that Microsoft is not interested in building a modern Office document model (after all, there is already an international ISO/IEC standard for that : ODF), it gets easier to understand why OOXML is such a mess.

It was recently brought up from people with vested interests in Office file formats that OOXML is a big mess :

  • Rob Weir (IBM) : "The Disharmony of OOXML"








    FormatText ColorText Alignment
    OOXML Text <w:color w:val="FF0000"/> <w:jc w:val="right"/>
    OOXML Sheet <color rgb="FFFF0000"/> <alignment horizontal="right"/>
    OOXML Presentation <a:srgbClr val="FF0000"/> <a:pPr algn="r"/>
    ODF Text <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end" />
    ODF Sheet <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>
    ODF Presentation <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>


  • Henning Brinkmann (OpenOffice contributor) : "OOXML Import In Writer: A Shape Is a Shape, Is a Shape?"

    When we started importing shapes from OOXML the Impress team already was able to import some shapes from OOXML files produced by PowerPoint 2007. This is DrawingML as described in chapter 5 of the Markup Language Reference for OOXML. But, if you use Word to insert a rectangle into a Word document (DOCX), you end up with VML as described in chapter 6 of the Markup Language Reference for OOXML. The Markup Language Reference tags VML as a deprecated format in OOXML, which is only included to the standard for backward compatibility reasons. Despite, Word 2007 uses VML to store shapes. (...) OOXML seems to be designed with the application model in mind. There may be different syntaxes for the same semantics, if it fits the already present application model better. But, if you want to create an alternative implementation for the format, this introduces additional effort.


Fire and motion, anyone? I also gave a few examples of that as well in my introductory OOXML is defective by design article where I showed that Excel SpreadsheetML uses no less than 6 ways to do basic text formatting, for no reason. If you do anything meaningful with those files, you need to implement both, which clearly means you'll be spending years doing so instead of concentrating on your own business.

For anyone except Microsoft, it's fire and motion and it guarantees there won't be a substantial implementation of file formats before a number of years. That is exactly the same thing than the strategy behind with older Microsoft Office formats.

Note that Microsoft claims a number of third-parties applications support OOXML. But they never go into details such as how much support there is. For a reason that is trivial.

To the CIO's hear, the fact that backwards compatibility is never associated in any shape or form to those third-party applications makes it very clear that they are not worth considering seriously. So there.

Why is it a problem?

Let's take Gnumeric. Gnumeric chose to implement some of OOXML by having import and export filters. This is a very important architecture decision. It means that the in-memory representation is not high-fidelity representation of the imported file until Gnumeric directly supports all tiny details in OOXML. Something that won't happen before a number of years. And it is therefore without much surprise that exporting the in-memory representation to say an .xlsx file will lose all features that Gnumeric does not directly support.

To see the effect of that, simply try to open and save the following .xlsx file in Gnumeric : Book_2.xlsx

Gnumeric's internal architecture is simply wrong : it should instead preserve the imported file. That is after all what XML fragments are for. It should be able to make in-place replacements of XML fragments and leave the rest of the file untouched. That's not what Gnumeric does. Gnumeric is an example of an application that is not XML native.

Why is Microsoft making so much good press for Gnumeric and it's so-called rich OOXML support is left for one to guess...

...especially when Office 2007 is not much XML native either. In fact, that is actually why we end up with so many versions of the truth in Office 2007 files. Why we have many different and incompatible ways to align a piece of text, or describe a vector shape, or text formatting is due to the following :

  • It is a direct reflection of the fact that Microsoft Office engineering teams don't share any of their work across teams. The trivial example of that is that the only shared spec, Office drawing library (mso.dll), ends up being a mix of VML and DrawingML despite the fact that VML is only a markup version of MSO, whereas DrawingML is supposed to be MSO version next (internally known as E2O).

  • What Microsoft calls XML streams are in fact XML fragments where each individual fragment are angle-brackets around a stream of content that reflects the internal binary representation (and in fact, the binary records of older files) to be converted back and forth in full-fidelity.

    Let's say in one of those fragments, an attribute relates to a horizontal text alignment option, one value out of four, we end up with an attribute of the form <w algn="r"> where algn is one of r, l, c, j. In another fragment, a horizontal text alignment option is one out of eight, in which case we may have <a:alignment>left</a:alignment>, where the value is left or general, right, center, justify, distributed, fill or center across selection.

    Here is the problem. The fact that the text alignment has several distinct definitions in the same document is a bug. Engineering, peer-review and interoperability practices inherently put enough pressure to resist from having such thing done in the first place, that's why both alignment options should be the same. There is no reason to have more than one in the document. To standardize bugs is simply not excusable. It has to be stopped.

    The reason why we end up like this with OOXML is because it just reflects the bugs in the binary records, and that it works as if the Office team had developers writing compatibility code independently from each other, that is without ever giving a chance of code reuse.

    So, due to the fragments being just another representation of the binary records, Office 2007 use of XML terminology is absolutely misleading. Press pass filled with the XML acronym all over the place are very appealing to CIOs, however. It is more fair to say it's angle brackets around complex stuff than actual XML. It that were truly native XML, that would be factored in to maximize the reuse of it across libraries, components and applications. Just like ODF does.


So there you have it. It is without much surprise that Office 2007 does not ship with any XML tooling support.

Thursday, March 13, 2008

Bad surprise in Microsoft Office binary documents : interoperability remains impossible

Stéphane Rodriguez, March 2008

Previous articles :
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


When Microsoft made a number of Microsoft Office binary documents available for direct download last month, I thought this was probably a good thing for those (including me) struggling with undocumented pieces with which only Microsoft can reliably migrate legacy documents back and forth. But this sentiment has long passed, what a disappointment.

Direct download, a sharp change from the obligation to...send an email



In fact, this gift is a small change to what was being made available so far. If you read Microsoft knowledge base article 840817, in the binary document specification sections, it says :

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


It is therefore a little disingenuous from Microsoft to make it as if those documents were not available.

Forensic purposes


The above specifically mentions forensic purposes, which is not the same thing than selling a product competing with Microsoft Office. It is unclear whether the new covenant not-to-sue terms are more open than those limiting forensic purposes claims, but clearly it's a problem. This specifically prevents competition.


Where is the VBA specification?


Interestingly enough, you can compare what was available at the expense of an email, and what is now being made available under the open specification promise, where the covenant not-to-sue applies. In particular, even though VBA is listed in the Microsoft knowledge base article,

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not-to-sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


the corresponding specification is not being made available under the open specification promise :

Microsoft Word

Word 97-2007 Binary File Format (.doc) Specification PDF | XPS

Microsoft PowerPoint

PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS

Microsoft Excel

Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS

Excel 2007 Binary File Format (.xlsb) Specification PDF | XPS

Office Drawing

Office Drawing 97-2007 Binary Format Specification PDF | XPS

Supporting Technologies (Windows structured storage, WMF, Ink)



Missing Excel binary BIFF specification



The Excel binary BIFF specification was made available until 1998, that was when Bill Gates decided that was no longer a good thing to make the file format specification available to third parties. Because the latest public document was made available in 1998, this document included the specification of BIFF8 records, i.e. used in Excel 97. No other specification update had ever been made available publicly, most notably none of the new records added in Excel 2000 (BIFF9), Excel XP (BIFF10), Excel 2003 (BIFF11), Excel 97-2003-2007 compatibility mode (BIFF11+) and Excel 2007 binary spreadsheets (BIFF12, with BIFF12 a new binary file format incompatible with previous BIFF versions) were documented, meaning that it was only possible at most to implement support for Excel file formats up to Excel 97. Not very friendly. But it gets worse.

If you take a look at the BIFF binary document being made available under Microsoft's open specification promise, you'll notice it includes a patchwork of all BIFF versions. Why Microsoft chose to put everything together instead of making sections for each BIFF version is subject to discussion. One can argue that those interested in the very latest don't care when a particular record was introduced or updated. But to anyone willing to read and write BIFF records consistently with a target Excel version, it is impossible to know which record belongs to which version. It gets far worse. If you take a look at what was made available until 1998 in MSDN Library, you'll notice two sections for BIFF :



A screenshot of the old binary specifications (MSDN Library, Feb 1998)

There are two big sections :
  • BIFF main records : cells, pivot tables, formatting styles
  • BIFF chart records


Here is the list of chart records (excerpt from MSDN Library, Feb 1998) :

Chart BIFF Records: Alphabetical Order

Number
Record

103A
3D: Chart Group Is a 3-D Chart Group

1051
AI: Linked Data

1050
ALRUNS: Text Formatting

101A
AREA: Chart Group Is an Area Chart Group

100A
AREAFORMAT: Colors and Patterns for an Area

100C
ATTACHEDLABEL: Series Data/Value Labels

1062
AXCEXT: Axis Options

1046
AXESUSED: Number of Axes Sets

101D
AXIS: Axis Type

1021
AXISLINEFORMAT: Defines a Line That Spans an Axis

1041
AXISPARENT: Axis Size and Location

1017
BAR: Chart Group is a Bar or Column Chart Group

1033
BEGIN: Defines the Beginning of an Object

1061
BOPPOP: Bar of Pie/Pie of Pie Chart Options

1067
BOPPOPCUSTOM: Custom Bar of Pie/Pie of Pie Chart Options

1020
CATSERRANGE: Defines a Category or Series Axis

1002
CHART: Location and Overall Chart Dimensions

1014
CHARTFORMAT: Parent Record for Chart Group

1022
CHARTFORMATLINK: Not Used

101C
CHARTLINE: Drop/Hi-Lo/Series Lines on a Line Chart

1063
DAT: Data Table Options

1006
DATAFORMAT: Series and Data Point Numbers

1024
DEFAULTTEXT: Default Data Label Text Properties

103D
DROPBAR: Defines Drop Bars

1034
END: Defines the End of an Object

1060
FBI: Font Basis

1026
FONTX: Font Index

1032
FRAME: Defines Border Shape Around Displayed Text

1066
GELFRAME: Fill Data

104E
IFMT: Number-Format Index

1015
LEGEND: Legend Type and Position

1043
LEGENDXN: Legend Exception

1018
LINE: Chart Group Is a Line Chart Group

1007
LINEFORMAT: Style of a Line or Border

1009
MARKERFORMAT: Style of a Line Marker

1027
OBJECTLINK: Attaches Text to Chart or to Chart Item

103C
PICF: Picture Format

1019
PIE: Chart Group Is a Pie Chart Group

100B
PIEFORMAT: Position of the Pie Slice

1035
PLOTAREA: Frame Belongs to Plot Area

1064
PLOTGROWTH: Font Scale Factors

104F
POS: Position Information

103E
RADAR: Chart Group Is a Radar Chart Group

1040
RADARAREA: Chart Group Is a Radar Area Chart Group

1048
SBASEREF: PivotTable Reference

101B
SCATTER: Chart Group Is a Scatter Chart Group

105B
SERAUXERRBAR: Series ErrorBar

104B
SERAUXTREND: Series Trendline

105D
SERFMT: Series Format

1003
SERIES: Series Definition

1016
SERIESLIST: Specifies the Series in an Overlay Chart

100D
SERIESTEXT: Legend/Category/Value Text

104A
SERPARENT: Trendline or ErrorBar Series Index

1045
SERTOCRT: Series Chart-Group Index

1044
SHTPROPS: Sheet Properties

1065
SIINDEX: Series Index

103F
SURFACE: Chart Group Is a Surface Chart Group

1025
TEXT: Defines Display of Text Fields

101E
TICK: Tick Marks and Labels Format

1001
UNITS: Chart Units

101F
VALUERANGE: Defines Value Axis Scale



If you take a look at what Microsoft makes available under the covenant not-to-sue, the BIFF documentation and all others, you'll soon realize that chart records are not included. In other words, a third-party can only, at best, implement support for Excel BIFF spreadsheets with no charts. Not exactly good news!

It actually gets worse.


A number of records remain absent



Obvious misses


Records such as those identified by 0x00EF and 0x01BA remain absent from the documentation. Those are records automatically created when a new Excel spreadsheet gets created with any Excel 97 (or more recent) version. These are obvious records, but a number of others are absent.


Trickier misses


Function byte codes are not documented. For instance, if you use a function such as CEILING(number, significance), you have no way to know with the documentation what bytecode this function gets encoded to since the table of function bytecodes isn't provided. There are over 300 built-in functions.

Even trickier is the fact that, in order to read/store or migrate a function through an extension mechanism (NAME or EXTERNALNAME records), or to build an Excel-like application, the actual function name string is used, but :
  • the list of function names is not in the documentation (note : it is given in another document, ECMA 376)
  • the list of actually localized function names isn't provided anywhere. (note : ECMA 376 lists the functions in English-US only)


It is problematic for a third-party not to have those localized function names for instance, since there is no way an application will be able to support such functions like Excel does without that. In fact, in order to know the names, you'll have to purchase Microsoft Office language packs :



Purchasing Office language packs, the only way to know the localized function names in Excel

A language pack is $24.95 each, and there are 36 of them! ($900 in total). Anytime you install a language pack, you can open a spreadsheet where the function of interest will appear under its localized name.

It's interesting that Microsoft claims to be committed to an open standard like that! If you intend to support Excel functions in any meaningful way, don't forget to include a $900 budget.



Many non-trivial records remain partially documented



Some records are missing, but even documented records remain partially documented records. Here are two simple examples :

  • BIFF documentation, OBJ record, page 153,

    ftCblsData (12h) check box data

    Offset | Name | Size | Contents

    0 | ft | 2 | =ftCblsData (12h)

    2 | cb | 2 | Length of ftCblsData

    4 | (Reserved) | var | Reserved


    The check box data remains undocumented. And so does most of internal objects stored inside the OBJ records. The OBJ records holds data for Form controls (listbox, edit, ...).

  • Office drawing documentation, page 7,

    Associated with each shape is a piece of client data that keeps the shape‘s anchor, text and OLE data, as well as host specific properties. The format of this structure is host-defined.


    And indeed, the msofbtClientData (F011) record isn't documented. Page 12, we learn :

    msofbtClientData (F011)
    host-defined
    host-specific data


Note that since this record has no section linked to it in the documentation, unlike a number of other records, it should be obvious for anyone from Microsoft to verify that this record isn't documented. This is unfortunate since this is an important record.

The two examples are a reflection of problems throughout all documents. What is meant by that is that, from Microsoft point of view, those documents may be "reference" documents in the sense that if you are taking a look at the corresponding source code, those documents are handy. But only they can do that. For everybody else, those documents are hints, not actual references. It is fire and motion, and it's been going on for the best part of two decades.


The migration scenario : here comes Microsoft's secret sauce



Let's take two examples showing how impossible a task it is to map binary records to XML elements. In this case, it will be calculation chains in spreadsheets.

The return of the calculation chains


If we do a full text search about calculation chains over the documents Microsoft has made available, we find two occurences in the BIFF documentation :

  • page 35 : ARRAY record, "chn, 4 bytes" ; Ignore the chn field when reading the BIFF file. If a BIFF file is written, the chn field must be 00000000h.
  • page 123 : FORMULA record, "chn, 4 bytes" ; The chn field should be ignored when you read the BIFF file. If you write a BIFF file, the chn field must be 00000000h.


That's about it. There is no actual definition and encoding algorithm.

In ECMA 376, part 4, a full text search about calculation chains returns the following :

  • page 2084, section 3.6 calculation chain ; The cells in a workbook can be calculated in different orders depending on various optimizations and dependencies. The calculation chain specifies the order in which the cells in a workbook were last calculated. The calculation chain only deals with cells that require calculation - i.e., it only deals with cells that contain formulas. It does not track or express dependencies amongst the formulas, but rather only records the order in which the cells were last calculated. (...) Since the xml data described in this section is not strictly required, the spreadsheet application is free to ignore the order in which the calculation chain specifies calculations.


And that "definition" is followed by an example.

So here are the problems :

  • How do you map the "chn" fields to a calculation chain ? In other words, how do you decode these?
  • Since "chn" fields are attached to formula and array records (in binary files), and that the calculation chain part is not in any way associated to the spreadsheet sheet parts (in SpreadsheetML files), it means a mapping needs it occur. What is it?
  • Both documentations make it clear that the calculation chain fields and calculation chain parts are essentially optional. But in practice, this is far from being the truth. In a previous article, I described what happens if you change the value in a cell without touching the calculation chain part : the spreadsheet is seen by Excel 2007 as a corrupt spreadsheet, it self explodes. On the other hand, deleting the calculation chain part will force a recalculation next time the spreadsheet is opened by the user, which is arbitrarily lengthy. If Excel is the only application that can ever avoid making recalculations, because it knows the secret sauce about calculation chains, no other application can be seen as an equivalent application. Quite the opposite, Excel will be seen as a first-class citizen, and everything else will be seen as simplistic (i.e. unreliable) spreadsheet applications.



Caching values to show drop lines in charts!


In binary files, adding a dropline to a chart is managed by a simple chart related record, CHARTLINE. Note that since chart records are not documented in the BIFF documentation, the following CHARTLINE definition comes from MSDN Library, Feb 1998 :

CHARTLINE: Specifies Drop/Hi-Lo/Series Lines on a Line Chart (101Ch)

This record specifies drop lines, hi-lo lines, or series lines on a line chart. If the chart has both drop lines and hi-lo lines, two CHARTLINE records will be present.

Record Data

Offset | Name | Size | Contents
4 | id | 2 | Drop lines/hi-lo lines. 0 = drop lines 1 = hi-lo lines 2 = series lines (the lines that connect the columns in a stacked column chart)


The way droplines are added is by having such record present or not. Everything else in the chart record stream is independent of the dropline.

Now let's do the following :

  • start Excel 2007
  • click on B2 and enter : jan, hit <TAB>, 3, <TAB>, 4, <TAB>, 5
  • select those cells, click on the Insert ribbon tab and click on the line chart button. Select whatever line chart type. The line chart shows up.
  • click on the Layout ribbon tab, then on Analysis, then Lines, then Drop lines. The drop lines shows up. You should see the following (the droplines are the vertical segments) :


    Adding droplines to a chart


  • save this file as Droplines.xlsx, close it, quit Excel 2007.
  • unzip the file and take a look at the xl/charts/chart1.xml part. You should see the following :


    The XML droplines element is indeed added to the chart description


  • The <c:dropLines> element is the exact equivalent of the binary CHARTLINE record. Or so it seems. now edit this part and remove the <c:strCache> and <c:numCache> elements, which are cached values. It should result in the following :


    Removing cached values in a chart description


  • put the edited part back in the zip file and open it in Excel 2007. The following appears :


    Droplines are gone!




Droplines are gone! In other words, droplines in XML charts are linked to cached values, contrary to how droplines used to work in binary spreadsheets (namely, cached values are not needed). Not only is this not documented anywhere, it puts a huge burden on implementers who now have, at minimum, to actually compute and store a cache of values from the chart data source only to be able to show droplines, despite the fact that the XML description is designed in a way such that the presence of element <c:dropLines> is all what is expected to add droplines to a chart.

Here are the Excel 2007 files (.xslx) :


And here are the binary files (.xls) :


In a binary file, cached values are stored in the chart footer, as follows :

[DIMENSIONS 000E] 00 00 00 00 03 00 00 00 00 00 01 00 00 00
[1065 0002] 02 00
[1065 0002] 01 00
[NUMBER 000E] 00 00 00 00 00 00 00 00 00 00 00 00 08 40
[NUMBER 000E] 01 00 00 00 00 00 00 00 00 00 00 00 10 40
[NUMBER 000E] 02 00 00 00 00 00 00 00 00 00 00 00 14 40
[1065 0002] 03 00




Conclusion



A couple of counter examples have demonstrated that Microsoft Office document interoperability is wishful thinking at this point. The documents made available by Microsoft for direct download won't and shouldn't impress third-party implementers since it does not help much.

What is being shown is that in addition to missing documentation, the binary documentations sometimes conflict with the ECMA 376 documentation, itself not a full documentation of the new XML-based formats anyway.

Who other than Microsoft is able to reliably migrate formats from binary to XML, and back ?

Microsoft seems to have anticipated the issue by creating...an open source project in Feb 15 whose task is to migrate files. But it's unclear that, short of rewriting the existing Microsoft Office source code, which could take a decade, how this project is supposed to achieve any meaningful goal. Obviously, ISO national bodies should take this open source project with a grain of salt...

A national body should probably ask instead Microsoft to open up the source code of the Microsoft Office compatibility pack since, after all, that is exactly what this component does. It migrates Microsoft Office documents back and forth, contains what's needed to migrate calculation chains, droplines, etc. and is therefore a perfect candidate for third-party implementations. Even better, Microsoft can make it available now!

Wednesday, February 27, 2008

Typical B.S. in technical articles about OOXML

Stéphane Rodriguez, February 2008

Previous articles :
The truth about Microsoft Office compatibility
OOXML is defective by design



Microsoft has posted an article on their community website related to OOXML. That technical article is supposed to compare OOXML's DrawingML markup for drawing objects and SVG, an open standard (W3C) developed by independent parties for drawing objects.

That article tries to appear neutral and fair, even though the author uses plenty of hyperbole to paint OOXML's DrawingML markup as the best thing since sliced bread, and demonize SVG as a simplistic markup, a point of view strikingly similar to Burton Group's study that painted OOXML as a wonderful file format, and demonized ODF as a simplistic file format. (we know Burton Group is a Microsoft client and business partner).

I recommend to read the pointed article before reading further.

To say it is a mess is a euphemism.

It is supposed to be a technical comparison, but the author never bothers mention what kind of comparison he's trying to make. There are indeed several comparison types :
- strict markup and semantics compatibility
- simple mapping to achieve markup and semantics compatibility
- blocking incompatibilities

Even in the blocking incompatibilies case, since DrawingML and SVG are both based on XML, there cannot be core incompatibilities preventing scenarios such as round-tripping. Indeed, the point of using XML is that you can always add missing markup from one language into another in order to account for the missing markup and semantics.

With that said, in the introduction, the document incorrectly defines SVG, portrayed as "something for the web", and uses this mischaracterization in the conclusion to actually show his affinity towards DrawingML : "a richer and nicer fit for Office documents".

The simple fact that the author uses "Office documents" instead of "Microsoft Office documents" is telling how little the author is willing to take into account the needs of general purpose Office documents, as opposed to a single vendor's Office document model.

What that document actually reflects is the author's own biased knowledge of the domain at hand.


1) The SVG definition : what it's about

According to the author, "Scalable Vector Graphics (SVG) is a graphics file format and Web development language based on XML. With SVG, user can describe the output (i.e. look and feel) of the content. SVG describes the common and extended feature set of today's graphical authoring environments, both tools and program. SVG is a common export format in today's graphical authoring environments. SVG is a language for describing two-dimensional graphics in XML. It allows three types of graphic objects: vector graphic shapes (e.g., paths consisting of straight lines and curves), images and text. With SVG, developers can create Web applications based on data-driven, interactive, and personalized graphics."

On the W3C website (http://www.w3.org/Graphics/SVG/About.html), SVG however is defined as follows :

"SVG is a language for describing two-dimensional graphics and graphical applications in XML. SVG 1.1 is a W3C Recommendation and forms the core of the current SVG developments. SVG Tiny 1.2 is the specification currently being developed as the core of the SVG 1.2 language (comments welcome). The SVG Mobile Profiles: SVG Basic and SVG Tiny are targeted to resource-limited devices and are part of the 3GPP platform for third generation mobile phones. SVG Print is a set of guidelines to produce final-form documents in XML suitible for archiving and printing.

Furthermore,

SVG is a platform for two-dimensional graphics. It has two parts: an XML-based file format and a programming API for graphical applications. Key features include shapes, text and embedded raster graphics, with many different painting styles. It supports scripting through languages such as ECMAScript and has comprehensive support for animation.

SVG is used in many business areas including Web graphics, animation, user interfaces, graphics interchange, print and hardcopy output, mobile applications and high-quality design.

SVG is a royalty-free vendor-neutral open standard developed under the W3C Process. It has strong industry support; Authors of the SVG specification include Adobe, Agfa, Apple, Canon, Corel, Ericsson, HP, IBM, Kodak, Macromedia, Microsoft, Nokia, Sharp and Sun Microsystems. SVG viewers are deployed to over 100 million desktops, and there is a broad range of support in many authoring tools.

SVG builds upon many other successful standards such as XML (SVG graphics are text-based and thus easy to create), JPEG and PNG for image formats, DOM for scripting and interactivity, SMIL for animation and CSS for styling.

SVG is interoperable. The W3C release a test suite and implementation results to ensure conformance.
"


As the definition tells, SVG can be seen as an ongoing work with a core set of markup to which an arbitrary number of so-called profiles can be attached to target specific devices for instance. i.e.

SVG Basic
SVG Tiny
SVG Mobile
SVG Print
...

SVG Basic itself uses fundamental standards such as CSS and Javascript to provide styling as well as programmability.

This definition provides ground for more profiles, such as an hypothetical "SVG chart", a profile that would define a charting package. In other words, SVG is fundamentally extensible to accomodate today's and tomorrow's needs.

Contrary to what the author claims, SVG is not specifically for the web.



2) DrawingML's definition.

In ECMA 376, Part 1

"DrawingML specifies the location and appearance of drawing elements in a package. For example, these elements could be, but are not limited to, shapes, pictures, and tables. The root element of a DrawingML XML fragment specifies the presence of a drawing at this location in the document.
A shape is a geometric object such as a circle, square, or rectangle; a picture is an image presented inside the document; and a table is a two-dimensional grid of cells organized into rows and columns. Cells and whole tables can have associated properties. A cell can contain text, for example. DrawingML also specifies the location and appearance of charts in a package. The root element of a chart part is chart, and specifies the appearance of the chart at this location in the document. In addition, DrawingML specifies package-wide appearance characteristics, such as the package's theme. The theme of a document specifies the color scheme, fonts, and effects, which can be referenced by parts of the document—such as text, drawings, charts, and diagrams—in order to create a consistent visual presentation.
A chart is a presentation of data in a graphical fashion, such as a pie chart, bar chart, line chart, in order to make trends and exceptions in the data more visually apparent.
"

This definition makes it very clear that, in constrast with SVG, DrawingML ties together hardcoded drawing object representations that no matter how they suit a particular vendor's product, entrenches anyone who would rely on DrawingML.

Just like elsewhere in ECMA 376, DrawingML is not based on any ISO standard. For instance, even though SVG can be tied to ISO-based ICC color profiles, used by industrial manufacturers, DrawingML colors are just the reflection of a vendor's own internal API.

No matter how quote, "rich and nice", end quote, this internal API is, it cannot compare with markup that can be adapted to arbitrary independent party profiles.

Since SVG is being built by a number of independent parties, whereas DrawingML is built by a single vendor, comparing SVG to DrawingML only contributes to putting the light on DrawingML proprietary developments. It actually goes against portraying DrawingML as a good thing to have.


3) Comparisons

3.1 Given the extensibility nature of SVG, there is no discussion whether SVG can support all of DrawingML elements. It can. It's a direct consequence of XML-based markup.

Now, that would be for markup compatibility.

3.2 Another comparison type is simple mapping to achieve markup and semantics compatibility. By that, the author would want to know if there is a way to map any DrawingML fragment to any *EXISTING* SVG fragment. But even that draws a few cases : which way, and is it both way?
- which way : if we have a specific markup element in DrawingML that we are willing to represent in existing SVG markup, this task is bound to fail if we are trying to map intricacies of DrawingML as opposed to general purpose drawing markup. For instance, in section 3.1.2.2 of that document, the author tries to match the DrawingML's a:comp markup with an equivalent in the existing SVG markup. And fails to find it. Well, there are good reasons for that. First of all, it puts a severe burden on implementers who have to compute and maintain color stacks (to resolve a:comp into an actual color when it comes to rendering that element), which is probably not desirable on devices where SVG is supposed to be thin and fast. For instance, if SVG is used for printing purposes, the printer driver using the equivalent of a:comp would have to compute and maintain color stacks. It is not sure that it is in either the printer manufacturer's interest to do so (implementation tax), or to manufacturers in general to do so. A preferred way is to use SVG profiles which provide room for selecting explicit colors among alternate lists (CSS based, most notably). What this allows is cheaper device implementation support of SVG based documents.

While this example of the a:comp markup is just one example, the entire document goes like this using the same logic.

- the other way : since DrawingML is hardcoded, an arbitrary number of SVG fragments may not be represented in DrawingML with markup and semantics compatibility. This problem is directly tied to the fact that 1) DrawingML is not based on SVG, it defines his own world 2) DrawingML lacks a number of SVG concepts and therefore cannot directly map markup and semantics. And there is one problem, today's Office 2007 runtime does not allow ANY FOREIGN XML ELEMENT in any of the core ECMA 376 languages, including DrawingML. In other words, if you take a DrawingML fragment, and add a few markup to tag it with information that defines some SVG concept from a SVG fragment, that would be seen by Office 2007 as a case of corrupt document.

- both ways : these are both cases above combined.


3.3 The last case of comparison is strict markup and semantics compatibility. This case is automatically bound to fail since DrawingML was developed by a single vendor and is not based on other standards.



4) Conclusion

It does not demonstrate anything other than :

- there is no direct mapping between DrawingML and SVG, and that has the unfortunate consequence of pointing out the single vendor's proprietary design that went into DrawingML. Should the single vendor have designed DrawingML on top of SVG, the situation would be drastically different.

- the lack of the author's own understanding of the fundamental reason why people use XML markup : in essence you can add all the missing markup from one language into another to satisfy scenarios such as round-tripping (DrawingML ==> SVG ==> DrawingML) without loss.



Some additional background information. The author of that document is Sheela E N, an employee of Sonata software, an Indian consulting company. Sonata software has made a name for themselves since 2006 by contracting with...Microsoft for helping them test a Microsoft-sponsored open source project ODF-OOXML, under the guidance of French-based CleverAge, and a number of other consulting companies. Those ties define a biased source (directly paid) when it comes to discussing Microsoft file formats.

Friday, February 8, 2008

The truth about Microsoft Office compatibility

Stéphane Rodriguez, February 2008

 



The Microsoft Office product team rationale for designing such a poorly engineered file format known as OOXML is that, according to public statements they made, they do this not to advance the state of the art in Office document models, but to bring the two-decade worth of legacy and bugged if not broken features into the future.



They have a commercial reason to do so. According to their numbers, there is a 400,000,000 user install base. But with the introduction of the new file format OOXML and the accompanying application that they rushed out the door in order to ship in line with Windows Vista, is it true that Office 2007 is Microsoft's best thing since sliced bread when it comes to compatibility?



1) There is no Office 2007 64-bit edition

2) No support for 64-bit addins, user defined functions, existing ActiveX controls, OLE servers or even managed code (.NET).

3) No support for VBA in Excel services, part of their server suite

4) No support for VBA in Office 2008, the Mac version

5) Reduced speed of Office 2007 on 64-bit operating systems

6) Limited memory space of Office 2007 on 64-bit operating systems

7) Reduced legacy code compatibility of Office 2007 on 64-bit operating systems

8) OOXML supports 64-bit at all?

9) An open standard excluding competitors

10) Reverse engineering a necessity for compatibility reasons

11) No compliant file format




 


1) There is no Office 2007 64-bit edition



As amazing as it sounds, Microsoft didn't bother ship a 64-bit version of Office 2007, a version that would run natively on one of the 64-bit operating systems that they have made available on Windows (XP 64 edition, 2003 x64 edition, 2003 Itanium edition, 2008 server).



64-bit computers have become mainstream, and demand is growing very fast.



Microsoft mentions it on their website (Office system requirements) :


System requirements overview



The 2007 Microsoft Office system programs client is a 32-bit application and can run on a Windows 64-bit platform (Windows XP, Windows Server 2003, and Windows Vista) but there may be some feature limitations as noted in the system requirements below. (...)


The company that ships both Office and Windows couldn't find a good reason to provide a native support to their customers using 64-bit operating systems. Customers will have to run in emulated mode, otherwise known as WOW64 so that Office runs at all.



A number of limitations make it impossible once for all to use this product with the confidence that existing applications and add-ins work as intended. Despite compatibility claims. To add insult to injury, a number of performance issues arise as well, making hard to talk about compatibility without sounding like a liar. If you are in the process of purchasing Microsoft software, you may want to hold back a bit and perhaps consider an alternative.





The Microsoft Office compatibility matrix (true for all Office versions)


 

We are going to get in such details.



 


2) No support for 64-bit addins, user defined functions, existing ActiveX controls, OLE servers or even managed code (.NET).



Microsoft has not touched VBA, the scripting language that so many business users are fond of, for a number of years. In fact, it's been 7 years that they simply chose to outsource the maintenance of VBA (i.e. no more development) to Summit Software, a consulting company. VBA is at the heart of the so-called Office platform. It's what appeals to customers in need of custom applications built around Office. The fact that VBA is not being developed anymore should give anyone the assurance that anytime Microsoft ships a revision of Office, all applications based on VBA automatically work. Well, there is one problem : there isn't a 64-bit version of VBA.



If you are using a VBA-based third-party solution that depends on anything native of the 64-bit operating system, and get it loaded as part of Office, it simply won't work.



In retrospect, one has to ask, is the lack of 64-bit VBA the reason why Microsoft does not ship a 64-bit edition of Office ? This question is worth asking. If the answer is yes, it means Microsoft is lagging behind its own technology, and they are unable to meet the needs of anyone using a 64-bit operating system to take advantage of the 64-bit improvements over 32-bit.



What applies to VBA, applies to all user defined functions and even managed code (.NET) as well. No 64-bit user defined function or 64-bit managed code assembly can be loaded as part of a 32-bit image (Winword.exe, Excel.exe, Powerpoint.exe). As a result, your solutions and third-party solutions cannot take advantage of Office and the 64-bit operating system advantages simultaneously.



 


3) No support for VBA in Excel services, part of their server suite



Microsoft ships Excel in two editions, a desktop edition (Excel 2007) and a server edition (Sharepoint server 2007 which includes a server, Excel services). Even though Microsoft ships a 64-bit edition of the server suite, you still cannot use VBA as part of Excel services.



Why it is so? Simply VBA was designed as a STA COM apartment, Microsoft novlang to mean single-threaded. Excel services is designed to distribute calculations and serve users across threads (fundamentally multi-threaded architecture) and machines, so a single-threaded component cannot be part of this architecture.



The consequence is, should you consider using Excel services instead of the desktop of Excel to meet your needs, you automatically lose the ability to open reliably any spreadsheet embedding solutions and third-party solutions using VBA. For instance, Financial related add-ins shipping by default with Excel 2007 won't run.



 


4) No support for VBA in Office 2008, the Mac version



When Microsoft shipped the Mac version of Office 2007 in January this year, known as Mac Office 2008, it had no qualms taking away the support for VBA. Making any VBA-based solution broken on the Mac.



In the top 5 issues of the Microsoft website related to Mac Office 2008, it says :

My Visual Basic macros don't work



Cause: Office 2008 for Mac cannot run Visual Basic macros or load add-ins that contain Visual Basic macros.





Solution: Keep the macro in the file.

Solution: Remove the macro from the file.

Solution: Save the macro in another macro-enabled file format.

Solution: Create a new macro by using AppleScript.



Apparently, it does not matter so much that the ease and essence of VBA-embedded macros in documents employed by business users met their needs. Remember, VBA macros are a clever way for users to customize solutions without having IT people interfering. A remarkable feature is that, by embedding VBA macros, the deployment of solutions is trivial and does not involve the hassles of IT department. A sharp contrast with .NET, Microsoft's proposal for the future, which uses external files (assemblies) and additional rules for running : impossible to deploy without IT people.



 


5) Reduced speed of Office 2007 on 64-bit operating systems



When an application runs in WOW64, the emulation mode, it marshals memory access back and forth to adjust pointer sizes. Thereby creating a performance hit. Especially crucial for Word pagination, Excel calculations and Powerpoint animations.



Compatibility is therefore somewhat subjective. If your spreadsheet used to calculate within 10 minutes, and now it take twice the time, it's clear the in companies where speed is crucial, it is regarded as a bug that must be fixed.



One of the advantages of the 64-bit CPU is that whenever an application is natively 64-bit it maximizes the use of the increased register set. But when the application is running in WOW64, it has to restrict itself with the typical 32-bit CPU registers, and therefore unable to take advantage of the speed increase obtained by an increased register sets (as a result of less stacking).



 


6) Limited memory space of Office 2007 on 64-bit operating systems



Another advantage of a 64-bit operating system and CPU is the ability to run processes using a 64-bit wide address space. We are talking an infinite memory space, so to speak. But in 32-bit, and thefore in the WOW64 emulation mode, none of that is possible and a running process is restricted to 4GB of memory space, with 2GB preempted by the kernel.



Expecting to work with larger spreadsheets or documents gets stopped in the starting blocks. Making a purchase of such system is therefore not economically sound.



 


7) Reduced legacy code compatibility of Office 2007 on 64-bit operating systems



As per Microsoft own website,



Q. Are there any features in the 32-bit versions of Windows that are not in Windows Server 2003 x64 editions?


A.



A small number of features are not included in x64 Windows, including DOS, POSIX, 16-bit support, and a few legacy networking protocols no longer in active use. However, we do not expect most customers to be affected by these differences. Based on customer feedback, we expect the initial x64 usage scenarios for Windows Server 2003 to be databases, business applications, Terminal Server, Active Directory, Internet Information Services (IIS), and technical computing.


If you are using old Excel 4.0/5.0 modules, user-defined functions built on 16-bit code, that work with a 32-bit operating system, you better not expect it to work on the 64-bit operating system with Microsoft Office. The reason why is that the WOW64 emulation mode does not marshall 16-bit pointer sizes. No compatibility, period.



In the real world, many businesses are using old 16-bit solutions on a daily basis. There is no migration path for them on 64-bit operating systems.



 


8) OOXML supports 64-bit at all?



If anything, the previous sections have illustrated that "Microsoft Office" and "64-bit operating systems" are strangers. As such, how does one make a relevant compatibility case of OOXML consumed or produced by a native non-Microsoft 64-bit application? How do we know that today's OOXML makes sense at all across platforms?



 


9) An open standard excluding competitors



The ECMA 376 proposal tries to make a case for compatibility with legacy formats, such as binary formats. Forgetting a moment that the new XML formats still contain a number of binary blobs and are therefore in direct contradiction with the proposal, the goal states that :



ECMA 376, Part 1, Section 2.1 The goal of this clause is to define conformance, and to provide interoperability guidelines in a way that fosters broad and innovative use of the Office Open XML file format, while maximizing interoperability and preserving investment in existing files and applications.


It's pretty clear that the justification for the proposal is the preservation of the investment in files and applications. The preservation of the investment in applications has been addressed in the sections above of this article, where we have seen that the reality is a little different : Microsoft can't seem to be able to interoperate across their own platforms. With regards to binary files, it is just the same, as we are going to see.



Microsoft Office 2007 keeps secret how it migrates a binary file into a new file. The use case is neither described, documented, illustrated in ECMA 376 even though that is the stated goal of...ECMA 376. Competitors are excluded from doing the same reliably. Is that what we call exclusive openness?



We are in a case where the stated goal and content of the body of the ECMA 376 proposal don't match. This case was brought by a number of national bodies. The only national body where the discussions were made public (Microsoft originally vehemently opposed the exposure) was the US INCITS review group. This critical question remained unanswered, and it is known and widely reported that the US INCITS ballot brutally changed from a negative vote (overwhelmingly, first ballot in July 2007) to a positive vote (almost unanimously, in August 2007) after Bill Gates made a phone call to the US secretary.



To review the US INCITS discussions, head over to :


US INCITS V1 archive, April
US INCITS V1 archive, May
US INCITS V1 archive, June
US INCITS V1 archive, July
US INCITS V1 archive, August
US INCITS V1 archive, September
US INCITS V1 archive, October



 


10) Reverse engineering a necessity for compatibility reasons



Microsoft made clear publicly that the design of the XML vocabulary was not done at the expense of performance (performance is in the DNA of their engineering techniques, or so they say), and that they worked towards minimizing the performance hits due to the XML parsing, whose footprint is actually hidden thanks to the ZIP compression, a technique introduced by Open Office, the open source Office suite.



While this induced considerations such as keeping XML names as small as possible, and avoiding redundancies through redirections, both at the expense of implementors, what Microsoft won't say is that they went as far as making an extensive use of default values, that they consequently do not even store in the file.



It is a problem. You cannot find information when it is not in the file! Microsoft has thousands of default values that they use in a running instance of Office 2007, values that can be overridden by explicit attribute values, with the implication that non-Microsoft applications must reproduce the same in order to be compatible. All this is implementation-specific at the expense of the file format, that's where document design interferes with application design. Performance is in their DNA, but two decades of closed product development have consequences. Obviously Microsoft never designed Office 2007 with the idea that they would submit their internal specifications to an international organization.



In other words, short of Microsoft providing the exhaustive list of such default values, the combinations not only trivial ones, there is no way a competing application can be reliably compatible without reverse engineering Office 2007. Which defies the point of the ISO proposal.




 


11) No compliant file format



There is no strict "Save As ECMA 376" in Office 2007. There is no way an agent (application, third-party, SOX) is going to be able to produce such files. If that was not enough, ECMA 376 itself is a moving target due to Microsoft's own incompetence in proposing a format that was ready for international standards bare minimum requirements.



How does one achieve a file format compliance under those conditions?



 


Conclusion : this article has demonstrated that any angle you take the problem, compatibility is more wishful thinking than reality. In fact, experienced personnel will recognize much of the evolution of Microsoft Office over the years. The problem is that Microsoft is going to ISO with it. It contradicts even the most trivial assumptions.
Fast-tracking the ECMA 376 proposal made no sense at all. Microsoft is urged to go back to the drawing board.


Tuesday, August 28, 2007

Microsoft Office XML formats? Defective by design

 


Microsoft is trying to push new file formats that are using ZIP and XML. Are those new file formats any good for Office developers ? In other words, should anyone feel safe to make direct access to file parts, and start getting free of running instances of Microsoft Office and its COM object model, usually through VBA ?


Microsoft does not run out of teasing. There is ton of videos, see here, and here for example, screencasts, articles and blog posts (self-serving Microsoft blog posts mostly) about how much they are opening up. It boils down to the following, excerpt from Microsoft Office 12 introduction white paper  :


(...)
The use of XML offers the benefits of greater transparency and openness than were possible with the previous binary file formats. The new formats allow Office documents to easily integrate with existing and future line-of-business systems, as the contents are now open and accessible. The new formats are also
designed with long-term robustness and accessibility in mind. (...)
The binary file formats in use currently were designed in 1994—before the advent of XML and before widespread exchange of documents and data that is common today. These file formats, .doc, .xls, and .ppt, were introduced with the release of Microsoft Office 97, at a time when it was important to optimize the files for storage on slow hard drives and “floppy” disks; it was not as crucial to focus on easy access to data within the files for better content
reuse, document generation, and seamless integration of the documents into business processes
. (...)
The new XML-based file formats in these programs enable broader integration and interoperability between Office documents and enterprise applications.
Additionally, “Office 12” files are all wrapped using ZIP technologies, which allows for easy access to the content parts as well as standard compression, reducing file sizes and improving reliability and data recovery. (...) Because documents stored in the Open XML Formats are machine-readable and editable by any text editor or XML processor, solutions need not use
Microsoft Office programs to view or edit content within the documents.
Enterprise business solutions can access document contents easily and efficiently. Technology providers can utilize the Microsoft Office System and Office authoring applications within their solutions, reuse Microsoft Office documents as other Office documents, or open and act on Office documents on other platforms and in other applications.


 


They insist on the fact that, provided you make a valid use of the XML, pretty much changing the content of anything in an existing document can be achieved by sequentially 1) unzipping the content 2) making appropriate changes to one or more XML parts that are compatible with the provided XML schemas and open packaging relationships 3) zipping the content back .


Let's see if that's true.


1) Self-exploding spreadsheets


2) Entered versus stored values


3) Optimization artefacts become a feature instead of an embarrasment


4) VML isn't XML


5) Open packaging parts minefield


6) International, but US English first and foremost


7) Many ways to get in trouble


8) Windows dates


9) All roads lead to Office 2007


10) A world of ZIP+OLE files


11) Document security is a (bad) joke


12) BIFF is gone...not!


13) Document backwards compatibility subject to neutrino radio-activity


14) ECMA 376 documents just do not exist


15) How the ISO OpenDocument format (ODF) compares?


 



1) Self-exploding spreadsheets


To reproduce the scenario :


  • start Excel 2007 and create a new spreadsheet
  • insert a value 10 in a cell
  • insert a value 20 next to it
  • select the two cells and click on the "sum" button to create a sum of the two cells
  • save the spreadsheet (xlsx file)
  • close it and unzip it

The relevant XML in the corresponding part xl/worksheets/sheet1.xml is :


<row r="2" spans="3:5">
 <c r="C2">
  <v>10</v>
 </c>
 <c r="D2">
  <v>20</v>
 </c>
 <c r="E2">
  <f>SUM(C2:D2)</f>
  <v>30</v>
 </c>
</row>

Pretty simple XML. Now say we want to edit cell E2 and set a constant value of
40 in place of a formula. But instead of doing that with Excel 2007 interactively, we are going to do it manually :


  • unzip the file
  • grab a zip part known as xl/worksheets/sheet1.xml
  • make the edit described below
  • put the updated zip part back in the zip package

The corresponding valid (and carefully changed) XML for setting the constant
value of 40 in cell E2 is :


<row r="2" spans="3:5">
 <c r="C2">
  <v>10</v>
 </c>
 <c r="D2">
  <v>20</v>
 </c>
 <c r="E2">
  <v>40</v>
 </c>
</row>

Now open the file in Excel 2007. You get a blocking error message which says :


 

Excel 2007 cannot open the file we have manually updated


Followed by another even more frightening message :





According to Excel 2007, the problem is that the calculation chain is now corrupt


Interestingly enough, we thought parts of a spreadsheet file were updatable as long as we did not touch elements that are, according to the ECMA 376 documentation (the official paper from Microsoft), indexes to other parts.


Now that's an interesting issue. The ECMA 376 documentation says that the
calculation chain is the graph of formulas, sorted wrt to their dependencies. Suddenly, the little change we would like to make looks way more expensive. Rebuilding the graph of formulas ourselves is what Excel itself does, and sure enough it involves parsing the entire spreadsheet, discovering formulas, and applying formula parsing algorithms to induce a graph of the dependencies. It certainly sounds like we are going to have to rewrite a portion of Excel itself. Not every employer can afford waiting that much...Or perhaps that's Microsoft's way to tell us : you shall not touch this without our approbation.


We can perhaps get rid of the calculation chain, if we carefully delete the part and associated relationship. But then, there are three problems :


- Parts are intertwined together through implicit and explicit relationships
(relationships are separate zip entries). It gets even more grainy, and the risk of corruption increases. Again, we have to further take into account a number of details that are out of our scope, like how the calculation chain is defined as per the workbook part.


- If I were a programmer, the question would be : Microsoft gives no library
that I can use, at least not a library that I could use no matter the execution environment. Microsoft provides an API which works in a recent .NET run-time, and installs on Windows XP SP2 and Windows Vista. There goes the platform independence.


- If I delete the calculation chain, I am admitting that the resulting spreadsheet is being degraded, something that Excel 2007 does not suffer from since it takes care of that thanks to its infrastructure. In other words, making changes outside of Excel 2007 doesn't look as first-class citizen and safe and robust as it sounded first, and in either case either the application takes care of a lot of details, replicating what Excel itself does (there is no known non-Microsoft Excel 2007 implementation available out there), or Excel 2007 will have to do that for me next time the spreadsheet is opened. If the changes are made on a server without Excel 2007 installed and the resulting spreadsheet is distributed to employees throughout the organization, then every single employee will have to get through a full spreadsheet recalculation (arbitrarily lengthy) next time they open the spreadsheet. The application is a second-class citizen compared to Excel 2007, that's what everybody thinks : me and the employees.


What this shows pretty clearly is that either we lack the tools, or Microsoft
does not think we should be doing that in the first place. With that being said, if making a simple change to a cell is too much to ask, then what is this new format good for? The prospect of getting our recipients facing those dreaded message boxes is not exactly a change compared to the well known ugly stories with corrupted binary file formats.


Let's play the devil's advocate now and see how far it will take us. Here is the contents of the calculation chain, xl/calcChain.xml :



<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
 <c r="E2" i="1"/>
</calcChain>

The reference to formula in cell E2 is what seems to be causing the problem. All right, then since it's just XML, let's remove that reference. Edit the calculation chain xl/calcChain.xml so it looks like this :



<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<calcChain xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
</calcChain>

After the modification, if we open the file in Excel 2007, it still complains :


 

Excel 2007 cannot open the file even if the calculation chain is cleaned up manually


I particularly like the wording : catastrophic failure. Now it looks like a lot of blood is spilling. Well, then let's stop the hemorrhage, and remove the contents of this part altogether. Make the update, and open the file in Excel 2007, it continues to complain :


 

Excel 2007 cannot open the file even if the calculation chain is made empty


I guess it's time to take a look at the ECMA 376 documentation. In Part 4, page 2087, it says (emphasis mine) :



3.6.2 calcChain (Calculation Chain Info)
This element represents the root of the calculation chain.

<complexType name="CT_CalcChain">
 <sequence>
  <element name="c" type="CT_CalcCell" minOccurs="1" maxOccurs="unbounded"/>
  <element name="extLst" minOccurs="0" type="CT_ExtensionList"/>
 </sequence>
</complexType>

I guess there we have it, we can't have a calculation chain part with no cell reference in it. Excel guts spilling through the specs here, aren't it? Wait a minute, is this supposed to make into an international standard?


Let's do a quick summary : we had Excel 2007 complain that the calculation chain was left with a reference to a formula that did not exist anymore, but in fact Excel 2007 complains even with this manually fixed. The solution is to delete the physical calculation chain part and the relationship it has with the workbook (defined in the workbook relationship part xl/_rels/workbook.rels), and to update some other parts as well.



Let's get a visual diff of what it takes to make a "proper" change in a cell :



BookSample.xlsx (original)DIFFBookSample.xlsx (updated)
[Content_Types].xml1[Content_Types].xml
_rels/.rels