Friday, March 21, 2008

Custom XML? What Custom XML?

Stéphane Rodriguez, March 2008

More recent article on Custom XML (Jan 2010) :
- Shaving off standard XML for proprietary stuff
Previous articles :
- Backwards compatible? One more lie by omission
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


Other than backwards compatibility, the expression "Custom XML" plays an important role in Microsoft ISO OOXML evangelism. It's interesting that Microsoft bloggers don't even seem to be embarassed by ridiculous expressions such as "Custom XML". Custom XML is indeed just as silly as "Office Open XML" : the reason is X in XML already means Custom. So there cannot be a meaningful sense for Custom XML...unless Custom XML is short for Custom XML applications. And from that it makes sense. But there is a problem : Custom XML is part of the ISO proposal of a document file format, whereas Custom XML applications implies the apparatus and logic related to applications, inherently tied to products, platforms and operating systems, not documents. Did Microsoft feel guilty, removed the word applications knowing it wouldn't stand a chance otherwise ? That's for anyone to guess. In this article, we are going to delve into the so-called "Custom XML", and how little useful it is in practice.



Custom XML definition, as per Microsoft


Straight from the horse behind it, Brian Jones :

Custom XML is the support for custom defined schemas. It's that support that allows you truly integrate your documents with business processes and business data. You can define your data using XML Schema syntax, and then you can use that data in your Office documents. By opening up our formats with our reference schemas, and supporting your custom defined schemas, you get true interoperability of your documents. Sorry if this is currently sounding more like a marketing pitch, but I wanted to make sure I reiterated our vision for XML support in Office documents and hopefully that will help you see the power that we see. (...) Up until now we've talked about all the parts that we in Office have defined to create our documents. You as a developer also have the ability to add your own parts though. You can take any XML file and put it inside the ZIP package.


Translation : by XML, we actually mean several completely different things, and we've put all of it in the same pot. We think storing XML-based data inside the ZIP package is an efficient way to share your confidential corporate data to the outside world and we are proud to make it easy to do so. Likewise, storing business data and the document together illustrates the grand Microsoft vision about independent layers.


Straight from the marketing people at Microsoft, here Doug Mahugh (answering Patrick Durusau during the INCITS V1 review of OOXML back in April 2007) :

DIS 29500 (OOXML) serves other purposes that are not served by ISO/IEC 26300 (ODF), especially in the area of integration options for external schemas. (By "external" I mean schemas that are not part of the spec itself -- in common usage we tend to call these "custom schemas" as opposed to the "reference schemas" in DIS 29500).


Translation : at Microsoft, we ship products to make it possible to do what you can already do without.


Straight from a top Microsoft Office brass :

Open XML allows for custom XML markup within the body of a document which is a handy way to allow users to tag their content for interoperability with other types of software such as a custom line of business system.


Translation : we dare you put foreign XML markup into an existing XML, with no agreed-upon semantics between the two languages, in order to improve the interoperability. What kind of interoperability we are talking about is left as an exercise to the reader.



Enough marketing fluff. What it really is.


First, let's get out of the way that "Custom XML" actually means "Custom" "XML" at all.

  • Start Word 2007.
  • Create a new document.
  • Type "test".
  • Save and close the document.
  • Unzip it.
  • Grab the part word/document.xml, you should see the following :


    <w:p>
    <w:r>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Now add some custom XML markup :


    <w:p>
    <w:r>
    <w:t mytag="myvalue">test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007. It opens perfectly well.
  • Close it. Unzip the file again. Grab the part word/document.xml, you should see the following :


    <w:p>
    <w:r>
    <w:t mytag="myvalue">test</w:t>
    </w:r>
    </w:p>

  • Now remove the custom attribute and instead add some other custom XML markup, this time an element :


    <w:p>
    <w:r>
    <mytag>myvalue</mytag>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007. This time it brings the following error message :


    Corrupting the document by adding custom XML markup

  • Perhaps this is a namespace issue. Let's prefix our custom XML with w so that the fragment looks like the following :


    <w:p>
    <w:r>
    <w:mytag>myvalue</w:mytag>
    <w:t>test</w:t>
    </w:r>
    </w:p>

  • Put the edited part back into the ZIP file and open it in Word 2007, and indeed this time it opens well.



From those simple tests, we can infer the following :

  • There is no such thing as strict XML schema validation in Word 2007.

  • You can add a custom XML attribute with no namespace (therefore the attribute uses the element's namespace, in our case w), and Word 2007 will not complain even though the mytag attribute is not part of the OOXML reference schemas. See ECMA 376, part 4, page 199, section 2.3.2.23, the w:r element is defined as follows :


    <complexType name="CT_R">
    <sequence>
    <group ref="EG_RPr" minOccurs="0"/>
    <group ref="EG_RunInnerContent" minOccurs="0" maxOccurs="unbounded"/>
    </sequence>
    <attribute name="rsidRPr" type="ST_LongHexNumber"/>
    <attribute name="rsidDel" type="ST_LongHexNumber"/>
    <attribute name="rsidR" type="ST_LongHexNumber"/>
    </complexType>

    and therefore does not allow the mytag attribute to be present.

  • You cannot add a custom XML element unless it is prefixed by one of the namespaces declared in the header of the XML document, in our case w (xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"). Rather interesting again since w:mytag does not belong to the OOXML reference schemas, and yet Word 2007 does not complain about it.


Needless to say, this is a broken implementation of XML that neither satisfies the "strict XML" camp, nor the "loose XML" camp.

It can be summarized by the following table :






Word 2007
TestResultStrict validation
mytag="myvalue"OKNO
<mytag>myvalue</mytag>FAILN/A
<w:mytag>myvalue</w:mytag>OKNO


And for the other two applications, Excel 2007 and Powerpoint 2007, the results are as follows :






TestWord 2007 resultExcel 2007 resultPowerpoint 2007 result
mytag="myvalue"OKFAILFAIL
<mytag>myvalue</mytag>FAILFAILFAIL
<w:mytag>myvalue</w:mytag>OKFAILFAIL


Since Word 2007, Excel 2007 and Powerpoint 2007 do not handle Custom XML the same way, as the table above shows, it's very hard to trust Microsoft when they claim that Custom XML is a feature of OOXML.

Either application will see the document as a corrupt one is guaranteed to be random, Microsoft cannot possibly imply that the Custom XML they are talking about is what we take for granted when we say "Custom XML", i.e. the ability to add our own XML within the document.



Custom XML, as per ECMA 376


Since "Custom XML" does not mean "Custom" "XML", we have to rely on ECMA 376's definition of such thing. The first surprise is that there is a notion of "Custom XML markup" and a notion of "Custom XML data".

The second surprise is that the notion of "Custom XML markup" only appears in the documentation for Word documents. If we assume that this "Custom XML markup" will be used to bind the "Custom XML data" to the document at run-time, we can infer from ECMA 376 that this is only made possible for Word documents.

In other words, whatever "Custom XML" is, it is only fully implemented for Word, so it should be called "Custom XML in Word". When Microsoft marketing people are trying to sell us "Custom XML" as a feature of OOXML, it is a lie. Let's create a table of what we have just learned.





 Word 2007Excel 2007Powerpoint 2007
Custom XML dataYESYESYES
Custom XML markupYESNONO


"Custom XML data", ECMA 376 part 4 section 8, supported in both documents, is the ability to store an independent XML stream in the ZIP package. In fact, it is not a feature of OOXML at all, it is a feature of any ZIP archive. After all, a ZIP entry in a ZIP package can be anything, including an XML stream. We can infer from that, that "Custom XML data" is in fact nothing custom : the ability to store an independent XML stream is not something we should thank Microsoft for allowing us to do so.

"Custom XML markup", ECMA 376 part 4 section 2.5, supported only in Word documents, is the ability to bind the "Custom XML data" to the document's content, at run-time. Interestingly enough, the Word team at Microsoft haven't quite managed to merge this concept and the old "smart tag" concept. That's why in the ECMA 376 specification, we end up with several flavors of "Custom XML markup", one of which is smart tags, the ability to add metadata to content (eg. stock quote).

An interesting element is "run-time". If you write the Custom XML markup that make it possible to do the said data binding, it has to be reminded that the data binding is done by an instance of Word, not a third-party application. So data binding is just a lock-in. That's the difference between standardizing on a document versus standardizing on an application!

There is no mechanism for doing such thing in Excel spreadsheets and Powerpoint presentations. In Excel spreadsheets, the XML data binding, a feature available from the user interface, is a special case of data source querying where the data source is an XML stream. The XML stream is external to the ZIP package. In other words, the "Custom XML data" in spreadsheets is useless. In Powerpoint presentations, it's even more trivial since there is no such thing as an XML data binding mechanism from the user interface.



The merit of "Custom XML data"


Something interesting to note is that Microsoft thinks that storing data inside the ZIP package independently of the document is a good thing. From a pure technical point of view, you can view this "Custom XML data" as a cache of values thanks to which the consumer is able to drill into the data without a connection to the actual data source (corporate data). But there is a major flaw. Anybody using this feature will end up storing arbitrary data in ZIP packages shared across colleagues and others inside and outside the organization. Eventually, confidential information from the corporate databases will end up there, and a PR disaster automatically follows. You don't want to use this feature.



Conclusion


"Custom XML" does not mean much, despite Microsoft ample evangelism of said feature. Technically speaking it has no merit within the enterprise space because you end up sharing corporate data. An interesting fact is that "Custom XML" is actually only implemented in Word 2007. For instance, the ECMA 376 specification provides a data binding for Word 2007 documents, exclusively. Ironically enough, the ability to store an independent XML stream as part of a ZIP package, is just a feature of the ZIP library, not Microsoft's innovation.

Wednesday, March 19, 2008

Backwards compatible? One more lie by omission

Stéphane Rodriguez, March 2008

Previous articles :
- Bad surprise in Microsoft Office binary documents : interoperability remains impossible
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


Ever since Microsoft Office XML file formats were introduced, the excuse for using such poorly engineered markup was that it was designed to be backwards compatible with the so-called billions of existing documents out there.

Is it really the truth? Or is Microsoft using it to fuel more fire and motion?


Lying by omission, as per Wikipedia :
Lying by omission is when an important fact is omitted, deliberately leaving another person with a misconception. This includes failures to correct pre-existing misconceptions. One may by careful speaking contrive to give correct but only partial answers to questions, thus never actually lying.



Backwards and forwards compatibility


What's backwards compatibility first of all? For everyone in the world, except for Microsoft, file backwards compatibility means that product in version N can work seamlessly with files produced by the product in version N+1. For Microsoft however, it's the opposite, backwards compatibility means that product version N+1 can work seamlessly with files produced by the product in version N. By "work seamlessly" is meant no visible degradation, no noticeable loss in functionality.

To claim that a file format is backwards compatible is therefore quite an ambitious goal, especially if the file format in question is complex. It's so ambitious that it's almost an impossible task in practice. In fact, it behaves much like a theorem in mathematics. To prove the theorem is wrong, you only need one counter-example. To prove backwards compatibility is not met, you just need to come up with one counter-example.

So let's take a simple chart and see how backward compatibile it is.

  • Start Excel 2003 (or any Excel version older than Excel 2007)
  • Type jan, hit <TAB>, 3, <TAB> 4, <TAB>, 5
  • Select those cells, click the chart creation button in the main toolbar and choose a column type bar


The chart shows up like this :


A simple chart made with an older Excel version

Now resize the chart, and you can see that everything gets scaled accordingly : text in all chart elements, plot area, ...


Resizing a simple chart made with an older Excel version

  • Now start Excel 2007
  • Open this file
  • The chart shows up



The chart file (.xls) opened in Excel 2007

Now resize the chart, and you can see that fonts don't scale :


Resizing the chart file (.xls) in Excel 2007 does not scale the fonts

If you would like to resize the fonts, you have to select each chart element individually, and then apply the font size with user interface elements. Quite counter-productive. In addition, since that's manually done, proportions are not kept between chart elements and you cannot easily reproduce the same effect than with older versions of Excel.

Now if you save this file as a .xlsx file using Excel 2007, and then convert it back to a .xls file, then open it with an older version of Excel (effectively executing a round-trip scenario), the chart shows up but it does not scale when it is resized :


After the conversion back to (.xls), the fonts do not scale if we resize the chart




So much for backwards compatibility...

In case you wonder what's going on, what happens is that Excel 2007 does not support the auto scale option, something very pervasive in chart formatting options. Here is the guilty option :



Root of the problem : Excel 2007 does not support "auto scale" anymore (screenshot from Excel 2003)

It is not only a feature that is gone from the user interface, it is an option that is lost in the document itself. In other words, sharing this file with any person inside or outside your company will result in discrepancies.

For your information, the internal Excel BIFF record that Excel 2007 does not support is the FBI record (part of the chart records, and therefore left unspecified in Microsoft's binary BIFF documentation, as explained in my previous article) :

(excerpt from MSDN Library, Feb 1998)

FBI: Font Basis (1060h)
The FBI record stores font metrics.

Offset | Name | Size | Contents

4 | dmixBasis | 2 | Width of basis when font was applied

6 | dmiyBasis | 2 | Height of basis when font was applied

8 | twpHeightBasis | 2 | Font height applied

10 | scab | 2 | Scale basis

12 | ifnt | 2 | Index number into the font table



If you are interested in taking a look at the corresponding Excel files :



Note that this is just one simple example. There are many examples of all kinds. For instance, in a previous article, I showed how encrypting a document did encrypt the metadata as well, contrary to older Office application versions, and doing so breaks all applications expecting to access the file's metadata whether the file is encrypted or not.

As for forwards compatibility, the same article already showed (BIFF11+ section) that any Excel 2007 feature would be lost or partially lost when converted back to older Excel versions and part of a round-trip scenario. So forwards compatibility is not true either generally speaking.



Behind the scene, what's happening?


Behind the scene, what's happening is that Microsoft replaced the chart subsystem in Excel 2007 with a new library but failed to test it enough, presumably because they wanted to rush Office 2007 out the door. Since charts can also be linked to Word and Powerpoint documents, this actually affects all 3 applications' backwards compatibility.

The real mistake they made is to kill the old chart subsystem source code. What they should have done to meet their backwards compatibility claim was to use the old chart subsystem for the compatibility mode (old Excel files), and the new chart subsystem for charts in new files. There is no other way they can achieve the claims.



If backwards compatibility is not proven technically speaking, what is Microsoft trying to achieve?


In the previous section, it has been demonstrated with a simple counter-example that backwards compatibility leaves much to desire. So if this capability is not technically sound, how come Microsoft keeps hammering it every single day?

A Microsoft Office spokeperson made the following statement on March 16, 2008 :

An Open Letter from Chris Capossela, Senior Vice President, Microsoft Office

In today's digital world, there is a critical need to access, share and archive information. People want to share information with co-workers, business partners, family members and others regardless of the technology platform or software application being used. People want to be able to store and archive documents so that the information they contain can be accessed well into the future. Long-term access to documents is particularly important in the case of governments given the critical and historical nature of government information and the associated requirements for preservation and access moving forward.

Why choose the Ecma Office Open XML File Formats?
Office Open XML fulfills an important customer need - the file format was designed to be backward compatible with the content and functionality in billions of existing documents. (...)


The same thing has been hammered a gazillion times by Microsoft since the "new" file formats were announced back in 2005. If you are a software developer, then it does not sound like anything special. In fact, you may even think there is no substance at all in it. But if you wear a CIO cap, it's a different thing.

Remember, IT people are responsible for running the systems. For them backwards compatibility is not heard in the technical sense, it is synonym of "won't break systems, no down time". See where it goes?

A little more help. Microsoft, by talking over and over about backwards compatibility is actually telling decision makers that they are pledging not to break systems when everyone in the organization uses the applications and those new file formats (remember that for anyone, .docx .xlsx and .pptx files are alien files unless they install Office 2007).

In fact, it goes a little deeper. And that's when it gets really interesting. Microsoft is not only saying their stuff is backwards compatible, they are actually saying that they are the only backwards compatible when it comes to Office documents.

It is another way to say that the only way to ensure no down time is to purchase licenses of Microsoft Office software. As simple as that.

To better understand the picture, consider the following. During the 2007-20xx period, Microsoft is taking a major risk by introducing new file formats that cannot be natively used by older versions of Microsoft Office. It's something they did back in 1995 and it did not go very well. If you remember, this caused havok in IT organizations because Office 95 and Office 97 file formats were substantially different and Office 95 could not open Office 97 documents. Microsoft did break file formats unprepared.

So it is without surprise that Microsoft is trying to make sure that CIOs get the message right this time. That is the reason why Microsoft has been investing so much in all kinds of additional components :
  • Office 2007 compatibility mode
  • Microsoft Office compatibility pack
  • Office migration manager
  • Office MOICE


By saying what they say, Microsoft is trying to keep the competition aside, despite the very vocal "open standards" claims. The "open standards" claims is actually a diversion. "open" is not in Microsoft Office DNA and culture. For instance, nobody in that team ever said publicly that the only reason we are dealing with VML in the file formats is because they rushed Office 2007 out the door (presumably to ship simultaneously with Windows Vista), and as a result they did not fully transition to DrawingML. All this debate about VML out there is entirely the result of Microsoft's own incompetence in shipping a finished product. Simply shameful because third parties now have to support both VML and DrawingML. There would not be such surprise with a company that does things "openly".

In the period of time (that spans through a number of years) where Microsoft takes the risk of introducing new file formats, CIO could decide to just as well switch to an alternative Office suite for a fraction of the budget. Microsoft knows it very well, hence the media blitz. Hence the characterization of the alternative Office file formats such as ODF as simplistic. Simplistic is supposed to mean : not backwards compatible. That is what CIOs understand, and that is FUD.

In this period of time, Microsoft wants to minimize the ratio of organizations switching to an alternative Office suite. To Microsoft Office, OpenOffice is Netscape. This is not a problem anymore when the new "OOXML" file formats reach critical mass. We are not there yet, but that's the plan : dry the competition as much as possible, which includes fake activities such as starting so-called open source projects. Something that is not exactly music to CIOs' hears, why should CIOs care? It's the opposite. When a CIO hears "open source project", he understands something other than backwards compatible, therefore unreliable stuff. The decision is a no brainer then.

So what we are talking about here is a well orchestred media communication targeted to CIOs, not software developers. Until you get that, you don't understand.



The consequences of this attitude


Now that we know that Microsoft is not interested in building a modern Office document model (after all, there is already an international ISO/IEC standard for that : ODF), it gets easier to understand why OOXML is such a mess.

It was recently brought up from people with vested interests in Office file formats that OOXML is a big mess :

  • Rob Weir (IBM) : "The Disharmony of OOXML"








    FormatText ColorText Alignment
    OOXML Text <w:color w:val="FF0000"/> <w:jc w:val="right"/>
    OOXML Sheet <color rgb="FFFF0000"/> <alignment horizontal="right"/>
    OOXML Presentation <a:srgbClr val="FF0000"/> <a:pPr algn="r"/>
    ODF Text <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end" />
    ODF Sheet <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>
    ODF Presentation <style:text-properties fo:color="#FF0000"/> <style:paragraph-properties fo:text-align="end"/>


  • Henning Brinkmann (OpenOffice contributor) : "OOXML Import In Writer: A Shape Is a Shape, Is a Shape?"

    When we started importing shapes from OOXML the Impress team already was able to import some shapes from OOXML files produced by PowerPoint 2007. This is DrawingML as described in chapter 5 of the Markup Language Reference for OOXML. But, if you use Word to insert a rectangle into a Word document (DOCX), you end up with VML as described in chapter 6 of the Markup Language Reference for OOXML. The Markup Language Reference tags VML as a deprecated format in OOXML, which is only included to the standard for backward compatibility reasons. Despite, Word 2007 uses VML to store shapes. (...) OOXML seems to be designed with the application model in mind. There may be different syntaxes for the same semantics, if it fits the already present application model better. But, if you want to create an alternative implementation for the format, this introduces additional effort.


Fire and motion, anyone? I also gave a few examples of that as well in my introductory OOXML is defective by design article where I showed that Excel SpreadsheetML uses no less than 6 ways to do basic text formatting, for no reason. If you do anything meaningful with those files, you need to implement both, which clearly means you'll be spending years doing so instead of concentrating on your own business.

For anyone except Microsoft, it's fire and motion and it guarantees there won't be a substantial implementation of file formats before a number of years. That is exactly the same thing than the strategy behind with older Microsoft Office formats.

Note that Microsoft claims a number of third-parties applications support OOXML. But they never go into details such as how much support there is. For a reason that is trivial.

To the CIO's hear, the fact that backwards compatibility is never associated in any shape or form to those third-party applications makes it very clear that they are not worth considering seriously. So there.

Why is it a problem?

Let's take Gnumeric. Gnumeric chose to implement some of OOXML by having import and export filters. This is a very important architecture decision. It means that the in-memory representation is not high-fidelity representation of the imported file until Gnumeric directly supports all tiny details in OOXML. Something that won't happen before a number of years. And it is therefore without much surprise that exporting the in-memory representation to say an .xlsx file will lose all features that Gnumeric does not directly support.

To see the effect of that, simply try to open and save the following .xlsx file in Gnumeric : Book_2.xlsx

Gnumeric's internal architecture is simply wrong : it should instead preserve the imported file. That is after all what XML fragments are for. It should be able to make in-place replacements of XML fragments and leave the rest of the file untouched. That's not what Gnumeric does. Gnumeric is an example of an application that is not XML native.

Why is Microsoft making so much good press for Gnumeric and it's so-called rich OOXML support is left for one to guess...

...especially when Office 2007 is not much XML native either. In fact, that is actually why we end up with so many versions of the truth in Office 2007 files. Why we have many different and incompatible ways to align a piece of text, or describe a vector shape, or text formatting is due to the following :

  • It is a direct reflection of the fact that Microsoft Office engineering teams don't share any of their work across teams. The trivial example of that is that the only shared spec, Office drawing library (mso.dll), ends up being a mix of VML and DrawingML despite the fact that VML is only a markup version of MSO, whereas DrawingML is supposed to be MSO version next (internally known as E2O).

  • What Microsoft calls XML streams are in fact XML fragments where each individual fragment are angle-brackets around a stream of content that reflects the internal binary representation (and in fact, the binary records of older files) to be converted back and forth in full-fidelity.

    Let's say in one of those fragments, an attribute relates to a horizontal text alignment option, one value out of four, we end up with an attribute of the form <w algn="r"> where algn is one of r, l, c, j. In another fragment, a horizontal text alignment option is one out of eight, in which case we may have <a:alignment>left</a:alignment>, where the value is left or general, right, center, justify, distributed, fill or center across selection.

    Here is the problem. The fact that the text alignment has several distinct definitions in the same document is a bug. Engineering, peer-review and interoperability practices inherently put enough pressure to resist from having such thing done in the first place, that's why both alignment options should be the same. There is no reason to have more than one in the document. To standardize bugs is simply not excusable. It has to be stopped.

    The reason why we end up like this with OOXML is because it just reflects the bugs in the binary records, and that it works as if the Office team had developers writing compatibility code independently from each other, that is without ever giving a chance of code reuse.

    So, due to the fragments being just another representation of the binary records, Office 2007 use of XML terminology is absolutely misleading. Press pass filled with the XML acronym all over the place are very appealing to CIOs, however. It is more fair to say it's angle brackets around complex stuff than actual XML. It that were truly native XML, that would be factored in to maximize the reuse of it across libraries, components and applications. Just like ODF does.


So there you have it. It is without much surprise that Office 2007 does not ship with any XML tooling support.

Thursday, March 13, 2008

Bad surprise in Microsoft Office binary documents : interoperability remains impossible

Stéphane Rodriguez, March 2008

Previous articles :
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


When Microsoft made a number of Microsoft Office binary documents available for direct download last month, I thought this was probably a good thing for those (including me) struggling with undocumented pieces with which only Microsoft can reliably migrate legacy documents back and forth. But this sentiment has long passed, what a disappointment.

Direct download, a sharp change from the obligation to...send an email



In fact, this gift is a small change to what was being made available so far. If you read Microsoft knowledge base article 840817, in the binary document specification sections, it says :

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


It is therefore a little disingenuous from Microsoft to make it as if those documents were not available.

Forensic purposes


The above specifically mentions forensic purposes, which is not the same thing than selling a product competing with Microsoft Office. It is unclear whether the new covenant not-to-sue terms are more open than those limiting forensic purposes claims, but clearly it's a problem. This specifically prevents competition.


Where is the VBA specification?


Interestingly enough, you can compare what was available at the expense of an email, and what is now being made available under the open specification promise, where the covenant not-to-sue applies. In particular, even though VBA is listed in the Microsoft knowledge base article,

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not-to-sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


the corresponding specification is not being made available under the open specification promise :

Microsoft Word

Word 97-2007 Binary File Format (.doc) Specification PDF | XPS

Microsoft PowerPoint

PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS

Microsoft Excel

Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS

Excel 2007 Binary File Format (.xlsb) Specification PDF | XPS

Office Drawing

Office Drawing 97-2007 Binary Format Specification PDF | XPS

Supporting Technologies (Windows structured storage, WMF, Ink)



Missing Excel binary BIFF specification



The Excel binary BIFF specification was made available until 1998, that was when Bill Gates decided that was no longer a good thing to make the file format specification available to third parties. Because the latest public document was made available in 1998, this document included the specification of BIFF8 records, i.e. used in Excel 97. No other specification update had ever been made available publicly, most notably none of the new records added in Excel 2000 (BIFF9), Excel XP (BIFF10), Excel 2003 (BIFF11), Excel 97-2003-2007 compatibility mode (BIFF11+) and Excel 2007 binary spreadsheets (BIFF12, with BIFF12 a new binary file format incompatible with previous BIFF versions) were documented, meaning that it was only possible at most to implement support for Excel file formats up to Excel 97. Not very friendly. But it gets worse.

If you take a look at the BIFF binary document being made available under Microsoft's open specification promise, you'll notice it includes a patchwork of all BIFF versions. Why Microsoft chose to put everything together instead of making sections for each BIFF version is subject to discussion. One can argue that those interested in the very latest don't care when a particular record was introduced or updated. But to anyone willing to read and write BIFF records consistently with a target Excel version, it is impossible to know which record belongs to which version. It gets far worse. If you take a look at what was made available until 1998 in MSDN Library, you'll notice two sections for BIFF :



A screenshot of the old binary specifications (MSDN Library, Feb 1998)

There are two big sections :
  • BIFF main records : cells, pivot tables, formatting styles
  • BIFF chart records


Here is the list of chart records (excerpt from MSDN Library, Feb 1998) :

Chart BIFF Records: Alphabetical Order

Number
Record

103A
3D: Chart Group Is a 3-D Chart Group

1051
AI: Linked Data

1050
ALRUNS: Text Formatting

101A
AREA: Chart Group Is an Area Chart Group

100A
AREAFORMAT: Colors and Patterns for an Area

100C
ATTACHEDLABEL: Series Data/Value Labels

1062
AXCEXT: Axis Options

1046
AXESUSED: Number of Axes Sets

101D
AXIS: Axis Type

1021
AXISLINEFORMAT: Defines a Line That Spans an Axis

1041
AXISPARENT: Axis Size and Location

1017
BAR: Chart Group is a Bar or Column Chart Group

1033
BEGIN: Defines the Beginning of an Object

1061
BOPPOP: Bar of Pie/Pie of Pie Chart Options

1067
BOPPOPCUSTOM: Custom Bar of Pie/Pie of Pie Chart Options

1020
CATSERRANGE: Defines a Category or Series Axis

1002
CHART: Location and Overall Chart Dimensions

1014
CHARTFORMAT: Parent Record for Chart Group

1022
CHARTFORMATLINK: Not Used

101C
CHARTLINE: Drop/Hi-Lo/Series Lines on a Line Chart

1063
DAT: Data Table Options

1006
DATAFORMAT: Series and Data Point Numbers

1024
DEFAULTTEXT: Default Data Label Text Properties

103D
DROPBAR: Defines Drop Bars

1034
END: Defines the End of an Object

1060
FBI: Font Basis

1026
FONTX: Font Index

1032
FRAME: Defines Border Shape Around Displayed Text

1066
GELFRAME: Fill Data

104E
IFMT: Number-Format Index

1015
LEGEND: Legend Type and Position

1043
LEGENDXN: Legend Exception

1018
LINE: Chart Group Is a Line Chart Group

1007
LINEFORMAT: Style of a Line or Border

1009
MARKERFORMAT: Style of a Line Marker

1027
OBJECTLINK: Attaches Text to Chart or to Chart Item

103C
PICF: Picture Format

1019
PIE: Chart Group Is a Pie Chart Group

100B
PIEFORMAT: Position of the Pie Slice

1035
PLOTAREA: Frame Belongs to Plot Area

1064
PLOTGROWTH: Font Scale Factors

104F
POS: Position Information

103E
RADAR: Chart Group Is a Radar Chart Group

1040
RADARAREA: Chart Group Is a Radar Area Chart Group

1048
SBASEREF: PivotTable Reference

101B
SCATTER: Chart Group Is a Scatter Chart Group

105B
SERAUXERRBAR: Series ErrorBar

104B
SERAUXTREND: Series Trendline

105D
SERFMT: Series Format

1003
SERIES: Series Definition

1016
SERIESLIST: Specifies the Series in an Overlay Chart

100D
SERIESTEXT: Legend/Category/Value Text

104A
SERPARENT: Trendline or ErrorBar Series Index

1045
SERTOCRT: Series Chart-Group Index

1044
SHTPROPS: Sheet Properties

1065
SIINDEX: Series Index

103F
SURFACE: Chart Group Is a Surface Chart Group

1025
TEXT: Defines Display of Text Fields

101E
TICK: Tick Marks and Labels Format

1001
UNITS: Chart Units

101F
VALUERANGE: Defines Value Axis Scale



If you take a look at what Microsoft makes available under the covenant not-to-sue, the BIFF documentation and all others, you'll soon realize that chart records are not included. In other words, a third-party can only, at best, implement support for Excel BIFF spreadsheets with no charts. Not exactly good news!

It actually gets worse.


A number of records remain absent



Obvious misses


Records such as those identified by 0x00EF and 0x01BA remain absent from the documentation. Those are records automatically created when a new Excel spreadsheet gets created with any Excel 97 (or more recent) version. These are obvious records, but a number of others are absent.


Trickier misses


Function byte codes are not documented. For instance, if you use a function such as CEILING(number, significance), you have no way to know with the documentation what bytecode this function gets encoded to since the table of function bytecodes isn't provided. There are over 300 built-in functions.

Even trickier is the fact that, in order to read/store or migrate a function through an extension mechanism (NAME or EXTERNALNAME records), or to build an Excel-like application, the actual function name string is used, but :
  • the list of function names is not in the documentation (note : it is given in another document, ECMA 376)
  • the list of actually localized function names isn't provided anywhere. (note : ECMA 376 lists the functions in English-US only)


It is problematic for a third-party not to have those localized function names for instance, since there is no way an application will be able to support such functions like Excel does without that. In fact, in order to know the names, you'll have to purchase Microsoft Office language packs :



Purchasing Office language packs, the only way to know the localized function names in Excel

A language pack is $24.95 each, and there are 36 of them! ($900 in total). Anytime you install a language pack, you can open a spreadsheet where the function of interest will appear under its localized name.

It's interesting that Microsoft claims to be committed to an open standard like that! If you intend to support Excel functions in any meaningful way, don't forget to include a $900 budget.



Many non-trivial records remain partially documented



Some records are missing, but even documented records remain partially documented records. Here are two simple examples :

  • BIFF documentation, OBJ record, page 153,

    ftCblsData (12h) check box data

    Offset | Name | Size | Contents

    0 | ft | 2 | =ftCblsData (12h)

    2 | cb | 2 | Length of ftCblsData

    4 | (Reserved) | var | Reserved


    The check box data remains undocumented. And so does most of internal objects stored inside the OBJ records. The OBJ records holds data for Form controls (listbox, edit, ...).

  • Office drawing documentation, page 7,

    Associated with each shape is a piece of client data that keeps the shape‘s anchor, text and OLE data, as well as host specific properties. The format of this structure is host-defined.


    And indeed, the msofbtClientData (F011) record isn't documented. Page 12, we learn :

    msofbtClientData (F011)
    host-defined
    host-specific data


Note that since this record has no section linked to it in the documentation, unlike a number of other records, it should be obvious for anyone from Microsoft to verify that this record isn't documented. This is unfortunate since this is an important record.

The two examples are a reflection of problems throughout all documents. What is meant by that is that, from Microsoft point of view, those documents may be "reference" documents in the sense that if you are taking a look at the corresponding source code, those documents are handy. But only they can do that. For everybody else, those documents are hints, not actual references. It is fire and motion, and it's been going on for the best part of two decades.


The migration scenario : here comes Microsoft's secret sauce



Let's take two examples showing how impossible a task it is to map binary records to XML elements. In this case, it will be calculation chains in spreadsheets.

The return of the calculation chains


If we do a full text search about calculation chains over the documents Microsoft has made available, we find two occurences in the BIFF documentation :

  • page 35 : ARRAY record, "chn, 4 bytes" ; Ignore the chn field when reading the BIFF file. If a BIFF file is written, the chn field must be 00000000h.
  • page 123 : FORMULA record, "chn, 4 bytes" ; The chn field should be ignored when you read the BIFF file. If you write a BIFF file, the chn field must be 00000000h.


That's about it. There is no actual definition and encoding algorithm.

In ECMA 376, part 4, a full text search about calculation chains returns the following :

  • page 2084, section 3.6 calculation chain ; The cells in a workbook can be calculated in different orders depending on various optimizations and dependencies. The calculation chain specifies the order in which the cells in a workbook were last calculated. The calculation chain only deals with cells that require calculation - i.e., it only deals with cells that contain formulas. It does not track or express dependencies amongst the formulas, but rather only records the order in which the cells were last calculated. (...) Since the xml data described in this section is not strictly required, the spreadsheet application is free to ignore the order in which the calculation chain specifies calculations.


And that "definition" is followed by an example.

So here are the problems :

  • How do you map the "chn" fields to a calculation chain ? In other words, how do you decode these?
  • Since "chn" fields are attached to formula and array records (in binary files), and that the calculation chain part is not in any way associated to the spreadsheet sheet parts (in SpreadsheetML files), it means a mapping needs it occur. What is it?
  • Both documentations make it clear that the calculation chain fields and calculation chain parts are essentially optional. But in practice, this is far from being the truth. In a previous article, I described what happens if you change the value in a cell without touching the calculation chain part : the spreadsheet is seen by Excel 2007 as a corrupt spreadsheet, it self explodes. On the other hand, deleting the calculation chain part will force a recalculation next time the spreadsheet is opened by the user, which is arbitrarily lengthy. If Excel is the only application that can ever avoid making recalculations, because it knows the secret sauce about calculation chains, no other application can be seen as an equivalent application. Quite the opposite, Excel will be seen as a first-class citizen, and everything else will be seen as simplistic (i.e. unreliable) spreadsheet applications.



Caching values to show drop lines in charts!


In binary files, adding a dropline to a chart is managed by a simple chart related record, CHARTLINE. Note that since chart records are not documented in the BIFF documentation, the following CHARTLINE definition comes from MSDN Library, Feb 1998 :

CHARTLINE: Specifies Drop/Hi-Lo/Series Lines on a Line Chart (101Ch)

This record specifies drop lines, hi-lo lines, or series lines on a line chart. If the chart has both drop lines and hi-lo lines, two CHARTLINE records will be present.

Record Data

Offset | Name | Size | Contents
4 | id | 2 | Drop lines/hi-lo lines. 0 = drop lines 1 = hi-lo lines 2 = series lines (the lines that connect the columns in a stacked column chart)


The way droplines are added is by having such record present or not. Everything else in the chart record stream is independent of the dropline.

Now let's do the following :

  • start Excel 2007
  • click on B2 and enter : jan, hit <TAB>, 3, <TAB>, 4, <TAB>, 5
  • select those cells, click on the Insert ribbon tab and click on the line chart button. Select whatever line chart type. The line chart shows up.
  • click on the Layout ribbon tab, then on Analysis, then Lines, then Drop lines. The drop lines shows up. You should see the following (the droplines are the vertical segments) :


    Adding droplines to a chart


  • save this file as Droplines.xlsx, close it, quit Excel 2007.
  • unzip the file and take a look at the xl/charts/chart1.xml part. You should see the following :


    The XML droplines element is indeed added to the chart description


  • The <c:dropLines> element is the exact equivalent of the binary CHARTLINE record. Or so it seems. now edit this part and remove the <c:strCache> and <c:numCache> elements, which are cached values. It should result in the following :


    Removing cached values in a chart description


  • put the edited part back in the zip file and open it in Excel 2007. The following appears :


    Droplines are gone!




Droplines are gone! In other words, droplines in XML charts are linked to cached values, contrary to how droplines used to work in binary spreadsheets (namely, cached values are not needed). Not only is this not documented anywhere, it puts a huge burden on implementers who now have, at minimum, to actually compute and store a cache of values from the chart data source only to be able to show droplines, despite the fact that the XML description is designed in a way such that the presence of element <c:dropLines> is all what is expected to add droplines to a chart.

Here are the Excel 2007 files (.xslx) :


And here are the binary files (.xls) :


In a binary file, cached values are stored in the chart footer, as follows :

[DIMENSIONS 000E] 00 00 00 00 03 00 00 00 00 00 01 00 00 00
[1065 0002] 02 00
[1065 0002] 01 00
[NUMBER 000E] 00 00 00 00 00 00 00 00 00 00 00 00 08 40
[NUMBER 000E] 01 00 00 00 00 00 00 00 00 00 00 00 10 40
[NUMBER 000E] 02 00 00 00 00 00 00 00 00 00 00 00 14 40
[1065 0002] 03 00




Conclusion



A couple of counter examples have demonstrated that Microsoft Office document interoperability is wishful thinking at this point. The documents made available by Microsoft for direct download won't and shouldn't impress third-party implementers since it does not help much.

What is being shown is that in addition to missing documentation, the binary documentations sometimes conflict with the ECMA 376 documentation, itself not a full documentation of the new XML-based formats anyway.

Who other than Microsoft is able to reliably migrate formats from binary to XML, and back ?

Microsoft seems to have anticipated the issue by creating...an open source project in Feb 15 whose task is to migrate files. But it's unclear that, short of rewriting the existing Microsoft Office source code, which could take a decade, how this project is supposed to achieve any meaningful goal. Obviously, ISO national bodies should take this open source project with a grain of salt...

A national body should probably ask instead Microsoft to open up the source code of the Microsoft Office compatibility pack since, after all, that is exactly what this component does. It migrates Microsoft Office documents back and forth, contains what's needed to migrate calculation chains, droplines, etc. and is therefore a perfect candidate for third-party implementations. Even better, Microsoft can make it available now!