Friday, May 14, 2010

Microsoft Excel is not quite as accepting

One of the rules of thumb of good programming is: be liberal in what you accept (in terms of input data) and be conservative in what you generate. In other words, don't be fussy with the data coming in because the poor folks who are preparing the data may not understand all of your validation rules. However, be very particular when creating output, because the poor folks consuming your output may not understand all of the possible combinations of your data.

Microsoft didn't follow this rule with their OOXML parsers in MS-Excel. At least, that how it appears to me.

The OOXML formats are advertised as "open", meaning that they specifications are available and people are free to use the files outside of MS-Excel. The OOXML specification is not for a the faint of heart, though. With determination and stamina, one learns that the MS-Excel .xlsx file is a ZIP file with several contained XML files.

Reading an .xlsx file is fairly straightforward. You unzip the container file and then pick through the unpacked XML files with whatever parser you like.

Creating an .xlsx file is a bit trickier. I ran into problems last year, when I was using Ruby to create such a file. I used an MS-Excel-generated file as a model, and created the necessary XML files with the content and cross-references within files. I then zipped the files together and handed the resulting file to MS-Excel. Excel wanted nothing to do with the file. (Yet Open Office would happily open the file and render the content properly.)

Even the simple act of unzipping an .xlex file and then re-zipping it yields a file unacceptable to MS-Excel.

Eventually I was able to create an .xlsx file, by creating an empty ZIP file and adding files to it, one at a time.

The problem is not the contents of the XML files, but the contents of the ZIP file. ZIP files contain not only the (compressed) files but also directory entries. The popular WinZIP program, when zipping up a tree of files, records the files and the directory names of the files, and the directory names get their own entry in the ZIP file.

MS-Excel is unprepared to handle directory entries in the .xlsx file. Creating the ZIP file "by hand" (and not with WinZIP) yields a file without directory entries.

The interesting part in all of this is the behavior of Open Office. It handled the "incorrect" file correctly. Open Office is more accepting of its input than Microsoft Excel. If I wanted to, I could create an .xlex file that was readable by Open Office and not readable by Microsoft Office.


No comments:

Post a Comment