The Duplicate Folders Mystery, Part I

The Folder Is a Lie

To paraphrase Douglas Adams: Documentum is big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. All that bigness can lead to some unexpected behavior as gears deep in the belly of the clockwork monstrosity grind against each other. Sometimes the impossible happens.

I’m going to tell you a story–it’s a mystery about folders and databases and conclusion jumping and the relativism of obviousness, but to appreciate the story requires some understanding of how folders really work. Be warned that things underneath look completely different than how they appear on the surface.

Documents and folders make up the bulk of the visible universe in Documentum. One thing they have in common is the need to “be somewhere”, inside a folder or a cabinet. Here’s where things start going Looking Glass: An object keeps a list of the folders and cabinets that contain it. Most people would expect parents (cabinets or folders) to keep lists of their children (folders and documents), but no! There’s no mammalian parental love here; children must fend for themselves.

That list is stored in a repeating attribute called i_folder_id. It contains one or more of Documentum’s internal identifiers, each being the r_object_id of one of its parents. (Every Documentum object has an r_object_id including plenty of “dark matter” objects the casual users never see.) Object IDs are those sixteen digit hexadecimal numbers. They’re great for the system because they’re guaranteed unique and never get reused, but not so nice on human eyes. In fact, the “i_” prefix here is Documentum’s shorthand for saying this is an internal attribute and people really shouldn’t look at it–and they absolutely shouldn’t ever try to change it themselves. Unless they’re mad, totally mad! Bwah-hahahaha! But I digress.

There are some good reasons to do it this way. Repeating attributes can really kill performance if they get too big, so it’s better to have a bunch of small lists on the children than a few really long lists on the parents. It also makes more sense to deal with containment on the child when you start thinking about other behaviors like change tracking, permissioning, and versioning. This is also why the folder metaphor really falls apart when things start getting interesting in a document management sense.

It does create some problems, the biggest of which is that it’s very hard to use a single query to walk back up this list or find things at an arbitrary depth. Walking back up a reverse linked list is an iterative (procedural) process, something that (functional) query languages don’t do very well–hence database procedures by the way. Documentum can’t make the folder metaphor work at all without something besides i_folder_id.

The solution was for each folder to keep a list of its own explicit paths–one kind of location in current webtop speak–in another repeating attribute called r_folder_path. Unlike i_folder_id, this is something you can see very easily by choosing “View > Locations” and looks like “/John Kominetz/Private Documents/World Domination Plans”. If you know that you’re looking for things in that exact location, you can write a query to find things in a snap by adding where folder(‘/yadda/yadda/yadda’). No arcane object ids or iterative processes required. It’s even what makes it easy to “do a descend” and find everything inside all the other folders inside the folder at “/yadda/yadda/yadda”.

This works only if each explicit path is unique like a phone number or a mailing address. No two child folders in the same parent can have the same name. Documents don’t care; you can (and people often do) have hundreds of “report.xls” documents in the same folder. They also don’t have r_folder_paths–only folders and cabinets do–which is why they can get away with that. (If you think they should, then you need to think about versioning and how enforcing unique document names would make things really unpleasant.) So Documentum makes sure you can’t have two folders in the same location with the same name. Not unless something goes horribly, horribly wrong.

I mentioned both attributes are repeating–they’re lists of values–which means that one thing can be in more than once place at a time. It’s more like the UNIX idea of a hard link than a Windows Shortcut. The latter is really a separate file that points to another file which is really only in one place. (Documentum does have something like a shortcut, but it’s for pointing to objects in different docbases.) Here again the traditional folder metaphor breaks down and leads to confusion. Some users say “link” and mean the additional locations they put something. Except for one very special case during object creation, every location is a link and no one link is more significant than another.

There’s another consequence to these two lists that will again seem irrational to the uninitiated. Most people would expect that both lists would have the same number of values. If i_folder_id tells me this folder is linked to two parents, then it should have two folder paths, right? Wrong. Let’s say folder (A) is in two folders (B and C) and each of those folders is in two cabinets (D and E, F and G respectively).

Here’s what i_folder_id looks like on folder A:

  1. B.r_object_id
  2. C.r_object_id

Here’s what r_folder_path looks like on folder A:

  1. /D/B/A
  2. /E/B/A
  3. /F/C/A
  4. /G/C/A

This gets back to the fact that i_folder_id is the independent variable, the true representation of where something is. Documentum derives r_folder_path from i_folder_id. When you save a folder, it populates r_folder_path by getting all the r_folder_paths of its immediate parents and stapling its name onto the end of each. Assuming the r_folder_paths on its immediate parents are correct, it’s a great optimization to avoid having to walk up who-knows-how-many levels of that reverse linked list. Caveats and aphorisms about assumptions do apply.

The server also verifies during save that there aren’t any other folders in the same place with the same name. The save is an atomic operation, meaning that it either completely succeeds or fails. There’s no danger of having the folder left in some horrible transitional state like a Brundle folder. The server does this work, and it tells clients when things like saves fail so they (and their users) have a chance to correct the problem and try again.

One final point here. Have you ever noticed how long it takes to save a cabinet or high-level folder with lots of children when you change it’s name or link/unlink it? That’s because the save has to update all the r_folder_path strings of all of the folders it contains as well as its own r_folder_paths. That cascade update is also hitting a repeating attribute, and repeating attributes are notoriously unforgiving on performance when poked en masse like this. (Nested groups had a similar problem until Documentum deprecated dm_group’s equivalent to r_folder_path, i_all_users_names.) It takes so long because you’re not just updating that one object, you’re updating every single folder it contains! Either be patient or get the name right before filling it up.

Congratulations! You now have seen the man behind the curtain. This is how Documentum creates the illusion of folders and why sometimes the metaphor breaks down. Pretty clever really, but that’s exactly why it took me three years to solve the case of the duplicate folders, coming in Part II.