Wednesday, November 5, 2008

Adding metadata inside an OOXML document

I spent the day getting some metadata into a .docx file by poking it into the underlying wordprocessingML files -- principally the document.xml file. It's been pretty miserable, so I'm posting a few things in case you're having the same kind of day.

First, be aware Word 2007 will throw non-descriptive errors if there are any extra tags in the document (it will tolerate extra attributes but promptly throw them away upon save), and there are very specific .zip options that have to be used to re-create the .docx file. I'm using a nice free tool called Package Explorer to view and edit the .docx files in development, and inserting into my company's product: MarkLogic Server for production (which un-packs and manages the .docx archive as well).

Secondly, there are three kinds of metadata (for custom extensions) tags: smartTag, customXml and sdt. My brief description of all three:

<sdt> a reference to some data in another xml file inside the .docx zip archive. I didn't do much with this format.


<customxml>
This is a (possibly validated?) representation of XML where you can specify a schema of your choice for validating your metadata. This does NOT allow you to put the xml you want directly inside the OOXML. It allows you to encode your own xml using OOXML tags. E.g. instead of <myuri:myelem>myData</myuri:myelem>

you have to do something like:


<customxml uri="myUri" element="myElem">
<w:r><w:t>myData which I'm annoyed is showing up in the word doc</w:t></w:r>
</customxml>


A couple issues with customXml include that you need to put an entry into schema.xml for all the uri's that you use. E.g. to skip the uri="" attribute, you must add <w:attachedschema val=""> to schmea.xml. Doug Mahugh says that is a bug in Word 2007, btw.

<smartTag>
smartTag seems to be what I was looking for, but is the least blogged about or otherwise documented outside the spec. It's in part 3 of the spec on page 19. One trick here is that MS Word 2007 (all hail) seems to discard smartTag elements around paragraphs upon save. Instead, I used this tag inside a paragraph (w:p) tag as a sibling to the run (w:r) tags and it worked.

With smartTag you can specify some meaningless URI as a namespace, an arbitrary element name, and then put whatever data you want in, and it won't show up in your word doc. OTOH, you still have to use wordprocessingML/OOXML to awkwardly encode your xml as attributes:

<w:smartTag w:uri="http://schemas.openxmlformats.org/2006/smarttags"
w:element="stockticker">
<w:smartTagPr>
<w:attr w:name="fullCompanyName" w:val="Google"/>
</w:smartTagPr>
</w:smartTag>

In the above example, you really mean to say: <stockticker fullcompanyname="Google"> but you have to meta-encode it into the other XML format instead. Fortunately, you can use pretty trivial XQuery (or XSLT if you prefer) to convert it back.

For completeness I should metion that you can also squirrel additional data into the .docx zip archive if the data is at the document level rather than paragraph or block level.

BTW, the overall point to this is that I can now search the .xml that is implicitly authored with MS Word for specific paragraphs based on my custom tags. I'm going to use XQuery (including XPath) to do this against an XML database that holds both the binary zipped form of the .docx files and unzipped xml content from Word.

No comments: