Thursday, March 13, 2008

Bad surprise in Microsoft Office binary documents : interoperability remains impossible

Stéphane Rodriguez, March 2008

Previous articles :
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design


When Microsoft made a number of Microsoft Office binary documents available for direct download last month, I thought this was probably a good thing for those (including me) struggling with undocumented pieces with which only Microsoft can reliably migrate legacy documents back and forth. But this sentiment has long passed, what a disappointment.

Direct download, a sharp change from the obligation to...send an email



In fact, this gift is a small change to what was being made available so far. If you read Microsoft knowledge base article 840817, in the binary document specification sections, it says :

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


It is therefore a little disingenuous from Microsoft to make it as if those documents were not available.

Forensic purposes


The above specifically mentions forensic purposes, which is not the same thing than selling a product competing with Microsoft Office. It is unclear whether the new covenant not-to-sue terms are more open than those limiting forensic purposes claims, but clearly it's a problem. This specifically prevents competition.


Where is the VBA specification?


Interestingly enough, you can compare what was available at the expense of an email, and what is now being made available under the open specification promise, where the covenant not-to-sue applies. In particular, even though VBA is listed in the Microsoft knowledge base article,

Microsoft Office Binary File Formats

Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not-to-sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.

If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)


the corresponding specification is not being made available under the open specification promise :

Microsoft Word

Word 97-2007 Binary File Format (.doc) Specification PDF | XPS

Microsoft PowerPoint

PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS

Microsoft Excel

Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS

Excel 2007 Binary File Format (.xlsb) Specification PDF | XPS

Office Drawing

Office Drawing 97-2007 Binary Format Specification PDF | XPS

Supporting Technologies (Windows structured storage, WMF, Ink)



Missing Excel binary BIFF specification



The Excel binary BIFF specification was made available until 1998, that was when Bill Gates decided that was no longer a good thing to make the file format specification available to third parties. Because the latest public document was made available in 1998, this document included the specification of BIFF8 records, i.e. used in Excel 97. No other specification update had ever been made available publicly, most notably none of the new records added in Excel 2000 (BIFF9), Excel XP (BIFF10), Excel 2003 (BIFF11), Excel 97-2003-2007 compatibility mode (BIFF11+) and Excel 2007 binary spreadsheets (BIFF12, with BIFF12 a new binary file format incompatible with previous BIFF versions) were documented, meaning that it was only possible at most to implement support for Excel file formats up to Excel 97. Not very friendly. But it gets worse.

If you take a look at the BIFF binary document being made available under Microsoft's open specification promise, you'll notice it includes a patchwork of all BIFF versions. Why Microsoft chose to put everything together instead of making sections for each BIFF version is subject to discussion. One can argue that those interested in the very latest don't care when a particular record was introduced or updated. But to anyone willing to read and write BIFF records consistently with a target Excel version, it is impossible to know which record belongs to which version. It gets far worse. If you take a look at what was made available until 1998 in MSDN Library, you'll notice two sections for BIFF :



A screenshot of the old binary specifications (MSDN Library, Feb 1998)

There are two big sections :
  • BIFF main records : cells, pivot tables, formatting styles
  • BIFF chart records


Here is the list of chart records (excerpt from MSDN Library, Feb 1998) :

Chart BIFF Records: Alphabetical Order

Number
Record

103A
3D: Chart Group Is a 3-D Chart Group

1051
AI: Linked Data

1050
ALRUNS: Text Formatting

101A
AREA: Chart Group Is an Area Chart Group

100A
AREAFORMAT: Colors and Patterns for an Area

100C
ATTACHEDLABEL: Series Data/Value Labels

1062
AXCEXT: Axis Options

1046
AXESUSED: Number of Axes Sets

101D
AXIS: Axis Type

1021
AXISLINEFORMAT: Defines a Line That Spans an Axis

1041
AXISPARENT: Axis Size and Location

1017
BAR: Chart Group is a Bar or Column Chart Group

1033
BEGIN: Defines the Beginning of an Object

1061
BOPPOP: Bar of Pie/Pie of Pie Chart Options

1067
BOPPOPCUSTOM: Custom Bar of Pie/Pie of Pie Chart Options

1020
CATSERRANGE: Defines a Category or Series Axis

1002
CHART: Location and Overall Chart Dimensions

1014
CHARTFORMAT: Parent Record for Chart Group

1022
CHARTFORMATLINK: Not Used

101C
CHARTLINE: Drop/Hi-Lo/Series Lines on a Line Chart

1063
DAT: Data Table Options

1006
DATAFORMAT: Series and Data Point Numbers

1024
DEFAULTTEXT: Default Data Label Text Properties

103D
DROPBAR: Defines Drop Bars

1034
END: Defines the End of an Object

1060
FBI: Font Basis

1026
FONTX: Font Index

1032
FRAME: Defines Border Shape Around Displayed Text

1066
GELFRAME: Fill Data

104E
IFMT: Number-Format Index

1015
LEGEND: Legend Type and Position

1043
LEGENDXN: Legend Exception

1018
LINE: Chart Group Is a Line Chart Group

1007
LINEFORMAT: Style of a Line or Border

1009
MARKERFORMAT: Style of a Line Marker

1027
OBJECTLINK: Attaches Text to Chart or to Chart Item

103C
PICF: Picture Format

1019
PIE: Chart Group Is a Pie Chart Group

100B
PIEFORMAT: Position of the Pie Slice

1035
PLOTAREA: Frame Belongs to Plot Area

1064
PLOTGROWTH: Font Scale Factors

104F
POS: Position Information

103E
RADAR: Chart Group Is a Radar Chart Group

1040
RADARAREA: Chart Group Is a Radar Area Chart Group

1048
SBASEREF: PivotTable Reference

101B
SCATTER: Chart Group Is a Scatter Chart Group

105B
SERAUXERRBAR: Series ErrorBar

104B
SERAUXTREND: Series Trendline

105D
SERFMT: Series Format

1003
SERIES: Series Definition

1016
SERIESLIST: Specifies the Series in an Overlay Chart

100D
SERIESTEXT: Legend/Category/Value Text

104A
SERPARENT: Trendline or ErrorBar Series Index

1045
SERTOCRT: Series Chart-Group Index

1044
SHTPROPS: Sheet Properties

1065
SIINDEX: Series Index

103F
SURFACE: Chart Group Is a Surface Chart Group

1025
TEXT: Defines Display of Text Fields

101E
TICK: Tick Marks and Labels Format

1001
UNITS: Chart Units

101F
VALUERANGE: Defines Value Axis Scale



If you take a look at what Microsoft makes available under the covenant not-to-sue, the BIFF documentation and all others, you'll soon realize that chart records are not included. In other words, a third-party can only, at best, implement support for Excel BIFF spreadsheets with no charts. Not exactly good news!

It actually gets worse.


A number of records remain absent



Obvious misses


Records such as those identified by 0x00EF and 0x01BA remain absent from the documentation. Those are records automatically created when a new Excel spreadsheet gets created with any Excel 97 (or more recent) version. These are obvious records, but a number of others are absent.


Trickier misses


Function byte codes are not documented. For instance, if you use a function such as CEILING(number, significance), you have no way to know with the documentation what bytecode this function gets encoded to since the table of function bytecodes isn't provided. There are over 300 built-in functions.

Even trickier is the fact that, in order to read/store or migrate a function through an extension mechanism (NAME or EXTERNALNAME records), or to build an Excel-like application, the actual function name string is used, but :
  • the list of function names is not in the documentation (note : it is given in another document, ECMA 376)
  • the list of actually localized function names isn't provided anywhere. (note : ECMA 376 lists the functions in English-US only)


It is problematic for a third-party not to have those localized function names for instance, since there is no way an application will be able to support such functions like Excel does without that. In fact, in order to know the names, you'll have to purchase Microsoft Office language packs :



Purchasing Office language packs, the only way to know the localized function names in Excel

A language pack is $24.95 each, and there are 36 of them! ($900 in total). Anytime you install a language pack, you can open a spreadsheet where the function of interest will appear under its localized name.

It's interesting that Microsoft claims to be committed to an open standard like that! If you intend to support Excel functions in any meaningful way, don't forget to include a $900 budget.



Many non-trivial records remain partially documented



Some records are missing, but even documented records remain partially documented records. Here are two simple examples :

  • BIFF documentation, OBJ record, page 153,

    ftCblsData (12h) check box data

    Offset | Name | Size | Contents

    0 | ft | 2 | =ftCblsData (12h)

    2 | cb | 2 | Length of ftCblsData

    4 | (Reserved) | var | Reserved


    The check box data remains undocumented. And so does most of internal objects stored inside the OBJ records. The OBJ records holds data for Form controls (listbox, edit, ...).

  • Office drawing documentation, page 7,

    Associated with each shape is a piece of client data that keeps the shape‘s anchor, text and OLE data, as well as host specific properties. The format of this structure is host-defined.


    And indeed, the msofbtClientData (F011) record isn't documented. Page 12, we learn :

    msofbtClientData (F011)
    host-defined
    host-specific data


Note that since this record has no section linked to it in the documentation, unlike a number of other records, it should be obvious for anyone from Microsoft to verify that this record isn't documented. This is unfortunate since this is an important record.

The two examples are a reflection of problems throughout all documents. What is meant by that is that, from Microsoft point of view, those documents may be "reference" documents in the sense that if you are taking a look at the corresponding source code, those documents are handy. But only they can do that. For everybody else, those documents are hints, not actual references. It is fire and motion, and it's been going on for the best part of two decades.


The migration scenario : here comes Microsoft's secret sauce



Let's take two examples showing how impossible a task it is to map binary records to XML elements. In this case, it will be calculation chains in spreadsheets.

The return of the calculation chains


If we do a full text search about calculation chains over the documents Microsoft has made available, we find two occurences in the BIFF documentation :

  • page 35 : ARRAY record, "chn, 4 bytes" ; Ignore the chn field when reading the BIFF file. If a BIFF file is written, the chn field must be 00000000h.
  • page 123 : FORMULA record, "chn, 4 bytes" ; The chn field should be ignored when you read the BIFF file. If you write a BIFF file, the chn field must be 00000000h.


That's about it. There is no actual definition and encoding algorithm.

In ECMA 376, part 4, a full text search about calculation chains returns the following :

  • page 2084, section 3.6 calculation chain ; The cells in a workbook can be calculated in different orders depending on various optimizations and dependencies. The calculation chain specifies the order in which the cells in a workbook were last calculated. The calculation chain only deals with cells that require calculation - i.e., it only deals with cells that contain formulas. It does not track or express dependencies amongst the formulas, but rather only records the order in which the cells were last calculated. (...) Since the xml data described in this section is not strictly required, the spreadsheet application is free to ignore the order in which the calculation chain specifies calculations.


And that "definition" is followed by an example.

So here are the problems :

  • How do you map the "chn" fields to a calculation chain ? In other words, how do you decode these?
  • Since "chn" fields are attached to formula and array records (in binary files), and that the calculation chain part is not in any way associated to the spreadsheet sheet parts (in SpreadsheetML files), it means a mapping needs it occur. What is it?
  • Both documentations make it clear that the calculation chain fields and calculation chain parts are essentially optional. But in practice, this is far from being the truth. In a previous article, I described what happens if you change the value in a cell without touching the calculation chain part : the spreadsheet is seen by Excel 2007 as a corrupt spreadsheet, it self explodes. On the other hand, deleting the calculation chain part will force a recalculation next time the spreadsheet is opened by the user, which is arbitrarily lengthy. If Excel is the only application that can ever avoid making recalculations, because it knows the secret sauce about calculation chains, no other application can be seen as an equivalent application. Quite the opposite, Excel will be seen as a first-class citizen, and everything else will be seen as simplistic (i.e. unreliable) spreadsheet applications.



Caching values to show drop lines in charts!


In binary files, adding a dropline to a chart is managed by a simple chart related record, CHARTLINE. Note that since chart records are not documented in the BIFF documentation, the following CHARTLINE definition comes from MSDN Library, Feb 1998 :

CHARTLINE: Specifies Drop/Hi-Lo/Series Lines on a Line Chart (101Ch)

This record specifies drop lines, hi-lo lines, or series lines on a line chart. If the chart has both drop lines and hi-lo lines, two CHARTLINE records will be present.

Record Data

Offset | Name | Size | Contents
4 | id | 2 | Drop lines/hi-lo lines. 0 = drop lines 1 = hi-lo lines 2 = series lines (the lines that connect the columns in a stacked column chart)


The way droplines are added is by having such record present or not. Everything else in the chart record stream is independent of the dropline.

Now let's do the following :

  • start Excel 2007
  • click on B2 and enter : jan, hit <TAB>, 3, <TAB>, 4, <TAB>, 5
  • select those cells, click on the Insert ribbon tab and click on the line chart button. Select whatever line chart type. The line chart shows up.
  • click on the Layout ribbon tab, then on Analysis, then Lines, then Drop lines. The drop lines shows up. You should see the following (the droplines are the vertical segments) :


    Adding droplines to a chart


  • save this file as Droplines.xlsx, close it, quit Excel 2007.
  • unzip the file and take a look at the xl/charts/chart1.xml part. You should see the following :


    The XML droplines element is indeed added to the chart description


  • The <c:dropLines> element is the exact equivalent of the binary CHARTLINE record. Or so it seems. now edit this part and remove the <c:strCache> and <c:numCache> elements, which are cached values. It should result in the following :


    Removing cached values in a chart description


  • put the edited part back in the zip file and open it in Excel 2007. The following appears :


    Droplines are gone!




Droplines are gone! In other words, droplines in XML charts are linked to cached values, contrary to how droplines used to work in binary spreadsheets (namely, cached values are not needed). Not only is this not documented anywhere, it puts a huge burden on implementers who now have, at minimum, to actually compute and store a cache of values from the chart data source only to be able to show droplines, despite the fact that the XML description is designed in a way such that the presence of element <c:dropLines> is all what is expected to add droplines to a chart.

Here are the Excel 2007 files (.xslx) :


And here are the binary files (.xls) :


In a binary file, cached values are stored in the chart footer, as follows :

[DIMENSIONS 000E] 00 00 00 00 03 00 00 00 00 00 01 00 00 00
[1065 0002] 02 00
[1065 0002] 01 00
[NUMBER 000E] 00 00 00 00 00 00 00 00 00 00 00 00 08 40
[NUMBER 000E] 01 00 00 00 00 00 00 00 00 00 00 00 10 40
[NUMBER 000E] 02 00 00 00 00 00 00 00 00 00 00 00 14 40
[1065 0002] 03 00




Conclusion



A couple of counter examples have demonstrated that Microsoft Office document interoperability is wishful thinking at this point. The documents made available by Microsoft for direct download won't and shouldn't impress third-party implementers since it does not help much.

What is being shown is that in addition to missing documentation, the binary documentations sometimes conflict with the ECMA 376 documentation, itself not a full documentation of the new XML-based formats anyway.

Who other than Microsoft is able to reliably migrate formats from binary to XML, and back ?

Microsoft seems to have anticipated the issue by creating...an open source project in Feb 15 whose task is to migrate files. But it's unclear that, short of rewriting the existing Microsoft Office source code, which could take a decade, how this project is supposed to achieve any meaningful goal. Obviously, ISO national bodies should take this open source project with a grain of salt...

A national body should probably ask instead Microsoft to open up the source code of the Microsoft Office compatibility pack since, after all, that is exactly what this component does. It migrates Microsoft Office documents back and forth, contains what's needed to migrate calculation chains, droplines, etc. and is therefore a perfect candidate for third-party implementations. Even better, Microsoft can make it available now!