Wednesday, April 27, 2016

Data tools pay for themselves over time

Several years ago I worked on a project that required we pick apart Excel .XLS files at the byte level. We didn't use Microsoft's COM components to read the file. We didn't use a third-party library. We read the file (in C++ and in binary mode) and did exactly what Microsoft recommends we do not do: we parsed the file ourselves.

As part of this effort, I built a small utility to "dump" the contents of an Excel file. The common programs extract cell values, but I wanted something more detailed. I wanted a list of the various BIFF records in the file. (Microsoft Excel .XLS files consist of a series of binary records, each record describing some aspect of the contents. Each record contains a type field, a length field, and a set of fields that depends on the type of record, and some fields have a variable size. It's a compact and powerful format.)

The utility program (called "DumpXls", cleverly enough) was useful to identify the different records needed to construct a proper .XLS file. It took some time to create, time that turned out to be an investment.

It was a useful investment, because this past week I had a problem with the Python 'xlwt' library. (The 'xlwt' library lets Python create an .XLS file.) My old "DumpXls" utility helped me diagnose the problem and find a solution... and quickly. The time spent creating that program years ago was more than made up by the time saved in finding this week's problem.

The "dump" program was simple, yet it required a good understanding of the file format. When working with a new file format, it helps one understand the format to build such a "dump" program. The program is useful, as is the knowledge of the format.