Previous articles :
- Typical B.S. in technical articles about OOXML
- The truth about Microsoft Office compatibility
- OOXML is defective by design
When Microsoft made a number of Microsoft Office binary documents available for direct download last month, I thought this was probably a good thing for those (including me) struggling with undocumented pieces with which only Microsoft can reliably migrate legacy documents back and forth. But this sentiment has long passed, what a disappointment.
Direct download, a sharp change from the obligation to...send an email
In fact, this gift is a small change to what was being made available so far. If you read Microsoft knowledge base article 840817, in the binary document specification sections, it says :
Microsoft Office Binary File Formats
Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.
If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)
It is therefore a little disingenuous from Microsoft to make it as if those documents were not available.
Forensic purposes
The above specifically mentions forensic purposes, which is not the same thing than selling a product competing with Microsoft Office. It is unclear whether the new covenant not-to-sue terms are more open than those limiting forensic purposes claims, but clearly it's a problem. This specifically prevents competition.
Where is the VBA specification?
Interestingly enough, you can compare what was available at the expense of an email, and what is now being made available under the open specification promise, where the covenant not-to-sue applies. In particular, even though VBA is listed in the Microsoft knowledge base article,
Microsoft Office Binary File Formats
Microsoft makes its .doc, .xls, .xlsb, and .ppt binary file format specifications available under a royalty-free covenant not-to-sue to anyone who wishes to implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are also available under this program.
If you want to receive the documentation, contact Microsoft at the
following e-mail address to initiate the agreement sign-up process:
officeff@microsoft.com (mailto:officeff@microsoft.com)
the corresponding specification is not being made available under the open specification promise :
Microsoft Word
Word 97-2007 Binary File Format (.doc) Specification PDF | XPS
Microsoft PowerPoint
PowerPoint 97-2007 Binary File Format (.ppt) Specification PDF | XPS
Microsoft Excel
Excel 97-2007 Binary File Format (.xls) Specification PDF | XPS
Excel 2007 Binary File Format (.xlsb) Specification PDF | XPS
Office Drawing
Office Drawing 97-2007 Binary Format Specification PDF | XPS
Supporting Technologies (Windows structured storage, WMF, Ink)
Missing Excel binary BIFF specification
The Excel binary BIFF specification was made available until 1998, that was when Bill Gates decided that was no longer a good thing to make the file format specification available to third parties. Because the latest public document was made available in 1998, this document included the specification of BIFF8 records, i.e. used in Excel 97. No other specification update had ever been made available publicly, most notably none of the new records added in Excel 2000 (BIFF9), Excel XP (BIFF10), Excel 2003 (BIFF11), Excel 97-2003-2007 compatibility mode (BIFF11+) and Excel 2007 binary spreadsheets (BIFF12, with BIFF12 a new binary file format incompatible with previous BIFF versions) were documented, meaning that it was only possible at most to implement support for Excel file formats up to Excel 97. Not very friendly. But it gets worse.
If you take a look at the BIFF binary document being made available under Microsoft's open specification promise, you'll notice it includes a patchwork of all BIFF versions. Why Microsoft chose to put everything together instead of making sections for each BIFF version is subject to discussion. One can argue that those interested in the very latest don't care when a particular record was introduced or updated. But to anyone willing to read and write BIFF records consistently with a target Excel version, it is impossible to know which record belongs to which version. It gets far worse. If you take a look at what was made available until 1998 in MSDN Library, you'll notice two sections for BIFF :
A screenshot of the old binary specifications (MSDN Library, Feb 1998)
There are two big sections :
- BIFF main records : cells, pivot tables, formatting styles
- BIFF chart records
Here is the list of chart records (excerpt from MSDN Library, Feb 1998) :
Chart BIFF Records: Alphabetical Order
Number
Record
103A
3D: Chart Group Is a 3-D Chart Group
1051
AI: Linked Data
1050
ALRUNS: Text Formatting
101A
AREA: Chart Group Is an Area Chart Group
100A
AREAFORMAT: Colors and Patterns for an Area
100C
ATTACHEDLABEL: Series Data/Value Labels
1062
AXCEXT: Axis Options
1046
AXESUSED: Number of Axes Sets
101D
AXIS: Axis Type
1021
AXISLINEFORMAT: Defines a Line That Spans an Axis
1041
AXISPARENT: Axis Size and Location
1017
BAR: Chart Group is a Bar or Column Chart Group
1033
BEGIN: Defines the Beginning of an Object
1061
BOPPOP: Bar of Pie/Pie of Pie Chart Options
1067
BOPPOPCUSTOM: Custom Bar of Pie/Pie of Pie Chart Options
1020
CATSERRANGE: Defines a Category or Series Axis
1002
CHART: Location and Overall Chart Dimensions
1014
CHARTFORMAT: Parent Record for Chart Group
1022
CHARTFORMATLINK: Not Used
101C
CHARTLINE: Drop/Hi-Lo/Series Lines on a Line Chart
1063
DAT: Data Table Options
1006
DATAFORMAT: Series and Data Point Numbers
1024
DEFAULTTEXT: Default Data Label Text Properties
103D
DROPBAR: Defines Drop Bars
1034
END: Defines the End of an Object
1060
FBI: Font Basis
1026
FONTX: Font Index
1032
FRAME: Defines Border Shape Around Displayed Text
1066
GELFRAME: Fill Data
104E
IFMT: Number-Format Index
1015
LEGEND: Legend Type and Position
1043
LEGENDXN: Legend Exception
1018
LINE: Chart Group Is a Line Chart Group
1007
LINEFORMAT: Style of a Line or Border
1009
MARKERFORMAT: Style of a Line Marker
1027
OBJECTLINK: Attaches Text to Chart or to Chart Item
103C
PICF: Picture Format
1019
PIE: Chart Group Is a Pie Chart Group
100B
PIEFORMAT: Position of the Pie Slice
1035
PLOTAREA: Frame Belongs to Plot Area
1064
PLOTGROWTH: Font Scale Factors
104F
POS: Position Information
103E
RADAR: Chart Group Is a Radar Chart Group
1040
RADARAREA: Chart Group Is a Radar Area Chart Group
1048
SBASEREF: PivotTable Reference
101B
SCATTER: Chart Group Is a Scatter Chart Group
105B
SERAUXERRBAR: Series ErrorBar
104B
SERAUXTREND: Series Trendline
105D
SERFMT: Series Format
1003
SERIES: Series Definition
1016
SERIESLIST: Specifies the Series in an Overlay Chart
100D
SERIESTEXT: Legend/Category/Value Text
104A
SERPARENT: Trendline or ErrorBar Series Index
1045
SERTOCRT: Series Chart-Group Index
1044
SHTPROPS: Sheet Properties
1065
SIINDEX: Series Index
103F
SURFACE: Chart Group Is a Surface Chart Group
1025
TEXT: Defines Display of Text Fields
101E
TICK: Tick Marks and Labels Format
1001
UNITS: Chart Units
101F
VALUERANGE: Defines Value Axis Scale
If you take a look at what Microsoft makes available under the covenant not-to-sue, the BIFF documentation and all others, you'll soon realize that chart records are not included. In other words, a third-party can only, at best, implement support for Excel BIFF spreadsheets with no charts. Not exactly good news!
It actually gets worse.
A number of records remain absent
Obvious misses
Records such as those identified by 0x00EF and 0x01BA remain absent from the documentation. Those are records automatically created when a new Excel spreadsheet gets created with any Excel 97 (or more recent) version. These are obvious records, but a number of others are absent.
Trickier misses
Function byte codes are not documented. For instance, if you use a function such as
CEILING(number, significance)
, you have no way to know with the documentation what bytecode this function gets encoded to since the table of function bytecodes isn't provided. There are over 300 built-in functions.Even trickier is the fact that, in order to read/store or migrate a function through an extension mechanism (NAME or EXTERNALNAME records), or to build an Excel-like application, the actual function name string is used, but :
- the list of function names is not in the documentation (note : it is given in another document, ECMA 376)
- the list of actually localized function names isn't provided anywhere. (note : ECMA 376 lists the functions in English-US only)
It is problematic for a third-party not to have those localized function names for instance, since there is no way an application will be able to support such functions like Excel does without that. In fact, in order to know the names, you'll have to purchase Microsoft Office language packs :
Purchasing Office language packs, the only way to know the localized function names in Excel
A language pack is $24.95 each, and there are 36 of them! ($900 in total). Anytime you install a language pack, you can open a spreadsheet where the function of interest will appear under its localized name.
It's interesting that Microsoft claims to be committed to an open standard like that! If you intend to support Excel functions in any meaningful way, don't forget to include a $900 budget.
Many non-trivial records remain partially documented
Some records are missing, but even documented records remain partially documented records. Here are two simple examples :
- BIFF documentation, OBJ record, page 153,
ftCblsData (12h) check box data
Offset | Name | Size | Contents
0 | ft | 2 | =ftCblsData (12h)
2 | cb | 2 | Length of ftCblsData
4 | (Reserved) | var | Reserved
The check box data remains undocumented. And so does most of internal objects stored inside the OBJ records. The OBJ records holds data for Form controls (listbox, edit, ...). - Office drawing documentation, page 7,
Associated with each shape is a piece of client data that keeps the shape‘s anchor, text and OLE data, as well as host specific properties. The format of this structure is host-defined.
And indeed, the msofbtClientData (F011) record isn't documented. Page 12, we learn :msofbtClientData (F011)
host-defined
host-specific data
Note that since this record has no section linked to it in the documentation, unlike a number of other records, it should be obvious for anyone from Microsoft to verify that this record isn't documented. This is unfortunate since this is an important record.
The two examples are a reflection of problems throughout all documents. What is meant by that is that, from Microsoft point of view, those documents may be "reference" documents in the sense that if you are taking a look at the corresponding source code, those documents are handy. But only they can do that. For everybody else, those documents are hints, not actual references. It is fire and motion, and it's been going on for the best part of two decades.
The migration scenario : here comes Microsoft's secret sauce
Let's take two examples showing how impossible a task it is to map binary records to XML elements. In this case, it will be calculation chains in spreadsheets.
The return of the calculation chains
If we do a full text search about calculation chains over the documents Microsoft has made available, we find two occurences in the BIFF documentation :
- page 35 : ARRAY record, "chn, 4 bytes" ; Ignore the chn field when reading the BIFF file. If a BIFF file is written, the chn field must be 00000000h.
- page 123 : FORMULA record, "chn, 4 bytes" ; The chn field should be ignored when you read the BIFF file. If you write a BIFF file, the chn field must be 00000000h.
That's about it. There is no actual definition and encoding algorithm.
In ECMA 376, part 4, a full text search about calculation chains returns the following :
- page 2084, section 3.6 calculation chain ; The cells in a workbook can be calculated in different orders depending on various optimizations and dependencies. The calculation chain specifies the order in which the cells in a workbook were last calculated. The calculation chain only deals with cells that require calculation - i.e., it only deals with cells that contain formulas. It does not track or express dependencies amongst the formulas, but rather only records the order in which the cells were last calculated. (...) Since the xml data described in this section is not strictly required, the spreadsheet application is free to ignore the order in which the calculation chain specifies calculations.
And that "definition" is followed by an example.
So here are the problems :
- How do you map the "chn" fields to a calculation chain ? In other words, how do you decode these?
- Since "chn" fields are attached to formula and array records (in binary files), and that the calculation chain part is not in any way associated to the spreadsheet sheet parts (in SpreadsheetML files), it means a mapping needs it occur. What is it?
- Both documentations make it clear that the calculation chain fields and calculation chain parts are essentially optional. But in practice, this is far from being the truth. In a previous article, I described what happens if you change the value in a cell without touching the calculation chain part : the spreadsheet is seen by Excel 2007 as a corrupt spreadsheet, it self explodes. On the other hand, deleting the calculation chain part will force a recalculation next time the spreadsheet is opened by the user, which is arbitrarily lengthy. If Excel is the only application that can ever avoid making recalculations, because it knows the secret sauce about calculation chains, no other application can be seen as an equivalent application. Quite the opposite, Excel will be seen as a first-class citizen, and everything else will be seen as simplistic (i.e. unreliable) spreadsheet applications.
Caching values to show drop lines in charts!
In binary files, adding a dropline to a chart is managed by a simple chart related record, CHARTLINE. Note that since chart records are not documented in the BIFF documentation, the following CHARTLINE definition comes from MSDN Library, Feb 1998 :
CHARTLINE: Specifies Drop/Hi-Lo/Series Lines on a Line Chart (101Ch)
This record specifies drop lines, hi-lo lines, or series lines on a line chart. If the chart has both drop lines and hi-lo lines, two CHARTLINE records will be present.
Record Data
Offset | Name | Size | Contents
4 | id | 2 | Drop lines/hi-lo lines. 0 = drop lines 1 = hi-lo lines 2 = series lines (the lines that connect the columns in a stacked column chart)
The way droplines are added is by having such record present or not. Everything else in the chart record stream is independent of the dropline.
Now let's do the following :
- start Excel 2007
- click on B2 and enter : jan, hit <TAB>, 3, <TAB>, 4, <TAB>, 5
- select those cells, click on the Insert ribbon tab and click on the line chart button. Select whatever line chart type. The line chart shows up.
- click on the Layout ribbon tab, then on Analysis, then Lines, then Drop lines. The drop lines shows up. You should see the following (the droplines are the vertical segments) :
Adding droplines to a chart - save this file as Droplines.xlsx, close it, quit Excel 2007.
- unzip the file and take a look at the xl/charts/chart1.xml part. You should see the following :
The XML droplines element is indeed added to the chart description - The <c:dropLines> element is the exact equivalent of the binary CHARTLINE record. Or so it seems. now edit this part and remove the <c:strCache> and <c:numCache> elements, which are cached values. It should result in the following :
Removing cached values in a chart description - put the edited part back in the zip file and open it in Excel 2007. The following appears :
Droplines are gone!
Droplines are gone! In other words, droplines in XML charts are linked to cached values, contrary to how droplines used to work in binary spreadsheets (namely, cached values are not needed). Not only is this not documented anywhere, it puts a huge burden on implementers who now have, at minimum, to actually compute and store a cache of values from the chart data source only to be able to show droplines, despite the fact that the XML description is designed in a way such that the presence of element <c:dropLines> is all what is expected to add droplines to a chart.
Here are the Excel 2007 files (.xslx) :
- Chart+droplines, original .xlsx file (i.e. with cached values)
- Chart+droplines, manually updated .xlsx file (i.e. cached values removed)
And here are the binary files (.xls) :
- Chart+droplines, original .xls file (i.e. with cached values)
- Chart+droplines, manually updated .xls file (i.e. cached values removed)
In a binary file, cached values are stored in the chart footer, as follows :
[DIMENSIONS 000E] 00 00 00 00 03 00 00 00 00 00 01 00 00 00
[1065 0002] 02 00
[1065 0002] 01 00
[NUMBER 000E] 00 00 00 00 00 00 00 00 00 00 00 00 08 40
[NUMBER 000E] 01 00 00 00 00 00 00 00 00 00 00 00 10 40
[NUMBER 000E] 02 00 00 00 00 00 00 00 00 00 00 00 14 40
[1065 0002] 03 00
Conclusion
A couple of counter examples have demonstrated that Microsoft Office document interoperability is wishful thinking at this point. The documents made available by Microsoft for direct download won't and shouldn't impress third-party implementers since it does not help much.
What is being shown is that in addition to missing documentation, the binary documentations sometimes conflict with the ECMA 376 documentation, itself not a full documentation of the new XML-based formats anyway.
Who other than Microsoft is able to reliably migrate formats from binary to XML, and back ?
Microsoft seems to have anticipated the issue by creating...an open source project in Feb 15 whose task is to migrate files. But it's unclear that, short of rewriting the existing Microsoft Office source code, which could take a decade, how this project is supposed to achieve any meaningful goal. Obviously, ISO national bodies should take this open source project with a grain of salt...
A national body should probably ask instead Microsoft to open up the source code of the Microsoft Office compatibility pack since, after all, that is exactly what this component does. It migrates Microsoft Office documents back and forth, contains what's needed to migrate calculation chains, droplines, etc. and is therefore a perfect candidate for third-party implementations. Even better, Microsoft can make it available now!