Genii Weblog

OpenSesame: David vs. Goliath

Mon 11 Jun 2007, 08:51 PM

by Ben Langhinrichs
Speed isn't everything, and I was worried a bit about how to test OpenSesame with large files.  The question is, what is large but not simply fabricated.  It is not too hard to create a large spreadsheet (see OpenSesame: Edit view in spreadsheet), but it is not terribly complex content.  Besides, I want to work with word processing documents.  I decided to work with the ODF 1.0 specification itself, which is about 691 pages long with different formatting, images, etc.  If I could get a copy of Microsoft's Office Open XML 1.0 specification, I'd try that, because at 6000+ pages, you don't get much bigger, but I don't have a copy in ODF format.

Anyway, the task OpenSesame has to accomplish is to unzip the various XML parts in the OpenDocument-v1.odt and load them into memory.  The main piece is the content.xml, which is 4,517,062 bytes, and it has to be parsed and loaded into a DOM tree.  OpenSesame performed beautifully, unzipping, parsing, loading, traversing the tree to retrieve a bit of text (just to be sure it was working properly), and removing the tree from memory, all in under a second and without any glitches.  I wondered how the NotesDOMParser would handle a similar task.  Since there is no class for unzipping, I just started with the content.xml and had the NotesDOMParser load, parse and unload.  It took 9 seconds.  OK, speed isn't everything, but it doesn't hurt.

So, now I need some good stress tests.  Should I do a ReplaceText and replace every letter 'a' with the word 'zoo' (since that would expand the text values, it would put more stress than replacing the letter 'a' with the letter 'z')?  Should I do a LinkMatching across the whole document and provide contextual links to website?  Should I create a really mungo Notes document by converting the ODF to rich text?  What would be good tests to run to see how well OpenSesame works under pressure?

Copyright 2007 Genii Software Ltd.

What has been said:

599.1. Nathan T. Freeman
(06/12/2007 07:02 AM)

Consolidate every document in last year's Partner Forum into one long ODT.

By the way, are you using a standard parsing library? A 3rd party tool? How are you doing it?

Oh, and you should compare how long context.xml takes to load into the NotesSAXParser as well.

599.2. Ben Langhinrichs
(06/12/2007 07:25 AM)

Nathan - I am using a highly customized version of an open source parser which I adapted for my own use (very liberal open source license is part of the reason, highly efficient and understandable C code is the other). I am also using a customized (but not as heavily) versiuon of the zlib library.

I thought about the SAX parser, but it would be apples and oranges. Since OpenSesame uses a full tree structure, and needs to for the kind of processing I do, there isn't much point in comparing with a SAX parser. Part of my goal is seeing how the total memory impact works out for large documents, and that is only an issue with a DOM tree.

As for consolidating documents, I may well do that, but I think I will use the Designer Help, as it has many more links and images and tables, which the partner forum does not. Still, I like the general idea. - Ben

599.3. Ben Langhinrichs
(06/12/2007 07:40 AM)

OK, just for curiosity's sake, I tried this with the SAX parser, and it took 7 seconds, but it also leaked memory massively, which neither the NotesDOMParser or OpenSesame did. Interesting.

I am doing this in Notes 6.5.3, by the way. I'll try again with the Beta 3, but was determined to do tests with a production version of Notes to be fair.