Genii Weblog

OpenXML4J Scenarios

Fri 1 Jun 2007, 06:01 PM

by Ben Langhinrichs
Brian Jones wrote today about the Office Open XML API for Java, or OpenXML4J, which he describes as an<blockquote>open source project to create a Java library for consuming and generating files in the Open XML formats</blockquote>The most interesting part of the site to me is the scenarios page, which has some potential scenarios for how you would use the toolkit, along with very spiffy graphics.  For example, the following two scenarios are included:

Confidential information removal
Remove comments, annotations, document properties, personal information, presentation notes, tracked changes, ... from outbound documents.
Scenario remove content outbound

Macros removal
For security purposes, remove macros, inappropriate language and content from inbound documents.
Scenario remove content inbound

The question in my mind is, are these scenarios generalizable to other office formats, e.g., ODF?  Mostly, they are.  So, what are various ways to address these same general scenarios with Notes 8 and ODF?  Are they applicable (not just these two, but all of them)?  If so, are they solvable with current toolkits, or toolkits currently planned, or would they require something more, whether open source or proprietary, such as OpenSesame?

Copyright 2007 Genii Software Ltd.

What has been said:

596.1. Ian Randall
(06/03/2007 08:49 PM)

I assume that this blog topic was triggered by the changes (that came into affect on December 1st 2006) to the Federal Rules of Civil Procedures (FRCP), which equire organizations to adapt how they manage, retain, store and deliver electronically stored information (ESI) during the eDiscovery phase of legal proceedings.

In particular ensuring that corporate policies and procedures for document and email retention, eDiscovery readiness, and metadata management comply with the new FRCP legislation.

This is likely to be a very major international issue, impacting Lotus Notes email, integration with MS Office files, ODF Editors etc.

If multi-national organisations get this wrong, then the FRCP changes can have huge consequences to their global infrastructure and may also stop organisations (even outside of the US) from carrying on normal IT operations such as deleting spam, setting storage limits on email etc.

Deleting metadata from MS Office Documents (or any other file format) is only the tip of the iceberg.

596.2. Ian Randall
(06/07/2007 01:13 AM)

Modifying the file format of some large Word documents might not be such a bad idea.

I did a test today converting a 60MB Word document into ODF format using the IBM ODF editor. Ended up being 5.6MB, with reasonable fidelity after the conversion.

The only major issue that I had with formatting was the Table of Contents. Converting using OpenOffice 2.0 did a better job of converting the Table of Contents and also reduced the file size down to almost exactly the same file size (5.6MB).

Although I made the mistake of trying to convert the same open (unsaved document) to PDF using the IBM ODF editor and then tried to print it out with a PDF Print driver (rather than the build-in PDF converter), which exploded the file to over 2GB before I had to abort the process after 15 minutes. I guess that is what Beta code is meant to do.

I wonder of converting Word.doc format to OpenXML is going to produce similar results?