| |
 |
 |
 |
| Genii Weblog |
 |
|
 |
 |
Rich Text 101 - Rich Text Itself
Tue 7 Oct 2003, 11:59 PM

Wait! Don't adjust your picture! I really did mean to title this article "Rich Text Itself". Instead of talking about a specific rich text construct, I thought I'd address the idea of rich text itself, and especially where it is used in Lotus Notes/Domino. I should warn you up front, this may be the most technical and confusing of the Rich Text 101 topics, but for completeness sake, it needs to be covered. Don't worry, there won't be a quiz at the end.
Where is rich text used? Rich text is often thought of as the contents of a rich text field, such as the body of a mail message or forum post or an article such as this. While true, this misses the true scope of rich text and its importance to Notes/Domino. Whereas the "note" is the fundamental building block of the Lotus Notes database structure, with every document and design element made up of a note of one type or another, "rich text" is the fundamental building block used inside the "note". Forms are notes with a special rich text field called $Body, which contains the design of the form. Similarly, Subforms and Pages have a $Body rich text field containing their design, Image Resources have a $ImageData rich text field containing the image data, File resources and Style Sheets have a $FileData rich text field containing the file resource data, Frame Sets have a $FrameSet rich text field containing the frame set data, Shared Fields have a $Body rich text field with the field definition, and even Shared Actions have a $Body rich text field with design data. Other design elements such as Agents, Applets, Navigators, Outlines and Views have other forms of data storage rather than rich text fields, but you can see that rich text is used widely.
What is rich text, really? What we call "rich text" is known internally as "composite data records" or "CD records". It is merely a stream of record structures with a standardized header that defines where the record structure begins, what type of record it is, and how many bytes it uses. Some record structures are self contained, while some have data stuffed after the record in the stream. For example, a CDTEXT record hold text, but it is only 8 bytes of data about the color, font, attributes, point size, along with the header containing the "signature" and length. The actual text is kept in the stream after the CDTEXT record.
Why are structures used instead of a more standard format? The reason for this storage format is that it is extremely easy and fast to load the bytes into actual records in memory. Since CD records are a completely proprietary storage mechanism, documented but owned by IBM and under their complete control, there is no need for the generalized data structures inherent in "standard" formats such as HTML or XML. Of course, as computers have become faster and standards have become available, there is a lot of push to use more standard formats to store data. One of those format is MIME/HTML (see my Rich Text 101 article on MIME/HMTL). Another option would be XML.
How can record structures be used on other operating systems? It may occur to some to wonder how this works between operating systems. Record structures are packed on some systems, even byte aligned on others, while integer sizes vary as well. This brings up one of the oddest aspects of CD record storage. The actual format is the packed, byte aligned format available on Intel machines such as Windows and OS/2 (the original OS for Notes). On those systems, the records can be used directly. On other operating systems such as AS/400 and AIX, a conversion must happen to move the data into appropriately sized and positioned structures. This adds a small amount of overhead, but since the reading into memory happens just once as the form or field is loaded, it is not a major issue. The bytes are shifted, expanded or otherwise forced into the appropriate record structure, which can then be used as it is. This is mostly only an issue for API developers, but as more people try to use the C API from LotusScript, it is an important consideration.
What happens when record structures change in future releases of Notes? This touches on one of the most amazing features of Notes. From the very beginning, the design of Notes has ensured that all rich text, both as design and as content, is backwardly and forwardly compatible. Rich text created in Notes R1 can be read and edited and rendered to HTML by Domino. Similarly, a Notes R1 client can read and display rich text created in Notes 6, albeit only with whatever structures were available in R1. This is unprecedented in the database world, where data has to be normalized and restructured for every table change. Microsoft does a horrible job of reading Word 97 document in Word 2000, and there is no way for Word 97 to read Word 2000 documents. Similar things can be said about almost every database or word processing product/company.
While this is a phenomenal feat, the way it is accomplished is also a key to its great weakness. Every time a major change is made to rich text, another level of redundancy or complexity is added. Prior to R5, tables were allowed in rich text, but could never be nested. When nested tables were added in R5, a different set of CD structures and CD "signatures", the id which identifies the type of record, had to be added so that the inside tables would be ignored by earlier versions of Notes. Instead of a fairly simply series of CDTABLEBEGIN, multiple CDTABLECELLs and a CDTABLEEND, there can now be CDNESTEDTABLEBEGIN, CDNESTEDTABLECELL and CDNESTEDTABLEEND records. In addition, when tabbed tables were implemented in R5, a new CDPRETABLEBEGIN record had to be added to contain the additional information, since the CDTABLEBEGIN record did not have enough room.
While these are easily handled by the rich text editor (what you see when you open a rich text field or form in the Notes client), they are not so easily handled by either third party products or by LotusScript classes. Even in Notes 6.5, the NotesRichTextNavigator class cannot see or handle nested tables, because internally it was implemented to recognize CDTABLECELL as the start of a table cell, but doesn't know how to handle CDNESTEDTABLECELL. What is worse, if a rich text field is inside a table on the form, the rich text table is all made of CDNESTEDTABLECELLs and none of the tables can be seen by the NotesRichTextNavigator.
By the way, I don't mean to suggest that our Midas Rich Text LSX has not also had to adjust. Midas was created for R4.1, and updated for R4.5 and R4.6 as they came along, but it took eight months of development to handle the changes for R5, even though the Midas methods looked exactly the same in R5 as in R4.6.
The addition of new record structures to handle new features sometimes means that Notes is less efficient about storage than it would be if it were designed from scratch now. Take a rich text field created in 4.6 and re-save it in R5, and it often grows by about 15% to 20%, just from the additional structures. The growth in ND6 is less significant, more like 5% to 10%, but that is on top of the growth from R5. Rich text is starting to look pretty bloated. Unfortunately, there is little that can be done unless IBM wants to give up the wonderful strength of backwards compatibility.
A quick quiz Just kidding. While this topic is complex and messy, it should help to explain some of the oddities of rich text, such as how it manages to work between releases, and why it keeps growing. As for the puzzles this doesn't solve, I address many of those in my other Rich Text 101 articles, and I encourage you to read them as well, and to watch for future articles. Let me know if there are topics you want to see, or anything you want covered in more (or less) depth. Cheers!
|
 |
 |
 |
 | |
 |
| |