Image via Wikipedia
Microsoft's Open XML standard for Office documents is controversial. Depending on your point of view, it somewhere between a breakthrough in opening up the standard in which the world's most popular office suite saves documents, or a cynical ploy to counter the momentum behind OpenOffice and its rival Open Document XML standard. Amidst all the arguments, it is easy to lose sight of the simple fact that, whatever its merits, Microsoft Office does now save by default in an XML format.
When Office 2007 came out, it was not really sensible to use the new XML formats, such as .docx for Word and .xlsx for Excel, outside the safety of an intranet. If you sent one as an email attachment, there was a good chance the recipient would not be able to read it. Still, now we have Office 2010 and SharePoint 2010, and Open XML has become both more important and less troublesome. It is the only format fully supported by the new Office Web Apps, for example. Further, most of us have had to come to terms with Open XML, at least to the extent that we can open them. Maybe it is time to see if the new formats might actually be useful.
The most obvious advantage of XML is that you can read and write documents using standard XML libraries. There is no need to have Microsoft Office installed, or to try and run it behind the scenes on a web server in order to parse or generate documents. That said, an Open XML document is not a single file akin to an HTML page. Instead, it is a set of related XML files bundled together as a ZIP archive, using another Microsoft standard called the Open Packaging Convention (OPC) If you are curious, you can rename a .docx to .zip, extract the files, and have a look.
If you have in mind to generate Office documents using an XML library, the bad news is that it is not a trivial thing to get right. The good news, at least for .NET developers, is that Microsoft has an Open XML SDK that wraps this complexity. The further bad news though is that the SDK does not free you from having to understand how an Open XML document is put together.
I discovered this for myself after downloading the latest version of the SDK, newly updated for Office 2010. I wanted to investigate how easy it would be to have a web application generate Word documents for download. It turned out to be frustrating. There is what looks like a nice help document with the SDK, including Getting Started articles and a complete reference. Unfortunately, it does not tell the whole story. I was able to generate a simple document in no time, but was soon troubleshooting issues. For example, I found that whitespace at the end of text snippets was getting truncated, so that words ran together, until I figured out that a text element needs the space="preserve" attribute, wrapped by the .NET library as SpaceProcessingModeValues.Preserve. Using styles was also tricky. I followed the supplied article on applying a style to a paragraph, but although I used a standard Word style it was ignored. Styles are no use without a StylesDefinitionPart that contains definitions for the styles you will use.
In the end, the SDK documentation is little more than the famously lengthy Open XML specification, presented as classes and members instead of plain XML. There are few useful code examples.
Open XML is also verbose, and makes you realise how concise HTML is by comparison. You can embolden a word in HTML with a simple <b> element, but in Open XML you have to add a RunProperties element to the Run element that has your text, and then add the Bold element to RunProperties. Microsoft mitigates this verbosity by using very short element names; it is also one of the reasons for using ZIP compression on the final bundle.
It does get better, once you have done your homework on the OPC and Open XML, and created your own wrapper code for the .NET API. There is also a fantastic Productivity Tool included with the SDK, with which you can open and inspect Open XML documents. Its best feature is called Reflect Code, and generates the C# which creates the document you have opened. This means you can get a working example for any document, which can also be used as a starting point for your own document generation. For example, you can set up a document with some dummy text using the styles you need, generate the code, and then amend it to replace the dummy text with your own dynamic content. The only snag is that an innocent-looking and nearly blank Word document includes a large amount of metadata, themes and other stuff, so the generated code is over 2,000 lines long.
Microsoft could do a better job with the SDK documentation, but this is still a powerful tool for parsing and generating documents, and delivers a means of processing Microsoft Office files on the server without Office or SharePoint installed. If you want to know more, the official resource site is here. Finally, a commercial alternative with a more developer-friendly API for both binary and Open XML Office documents is Syncfusion's Essential DocIO; I've not used it but have heard good reports.