Coding Recipes: Understanding basic office document manipulation

This post for anyone who wants to know on a very base level how office documents work regarding their xml structure.

So the first thing you should know is that all office documents are actually a collection of xml files. The names of the files and folders inside this collection can be different depending on what type of office document you are working with (word, powerpoint, excel, etc). However in this tutorial we are going to be working with a word document.

Opening a very basic word document using a decompression tool (I used WinRAR in this example) you will see the following contents.

the _rels folder contains all the relationship files. These relationships are used to map xml files together.

So if you go into the word folder you will see a few different xml files.

If you go inside the _rels folder you will find a corresponding xml file to document.xml called document.xml.rels.

I added a chart to my document and you can see some new folders have been added. Opening document.xml you will see the following xml structure.

The main node is "document" and directly underneath this you can find the "body". This is the standard structure for this file. The body will mainly contain paragraphs (w:p) however there will be a few other nodes as you can see at the bottom there is a "w:sectPr" node. This is a section property node which contains information about the page (size, margin, columns, header, footer, etc). This node will always be found at the bottom of the body node. If you insert a section break inside your document then you will find other nodes like this inside the body node.

In this example I have inserted a chart. What this has done is inserted a w:drawing element which contains information about the chart. The data for the actual chart image however is stored elsewhere. To reference this data there is a r:id node on the c:chart element with a value of "rId5". If I then open the document.xml.rels file I can then see this id then points to the file charts/chart1.xml

So when you open this file in word it will deserialize these xml files into COM objects and show the document. If the xml markup does not correspond to the objects then you will get a corruption error in word. For example if I delete the charts folder and try to open the file I will get the following error:

You could also modify the contents of the xml file manually by extracting it from the docx modifing the contents and dragging it back into WinRAR. This is quite handy when you are trying to troubleshoot.

Coding Recipes

Featured Post

SQL Query in SharePoint

Tuesday, December 3, 2013

Understanding basic office document manipulation

1 comment: