« I saw the sunrise as | Main | Wooow... she even infused the meringue with an insusient hint of lemon zest! »

Spandex... it's a privilege, *not* a right

I've had an idea that's been kicking around in my head for a few years now. But I think I've started thinking about this semi-seriously recently. I certainly have no time to implement such an idea, but it's something that has intrigued me for quite a while -- it's an itch that I'd really like to scratch.

This is not necessarily a product or a get-rich-quick-killer-app, but it's something that bothers the crap outta me, and I wish I had a better tool.

The basic premise is that I'd like to have a proper knowledge management solution for my e-mail.

Mail clients today do not offer flexible enough filing systems for mail messages. Specifically, the concept of "mail folders" is no longer good enough for a society that has become highly dependent upon e-mail. Users tend to create large, complex hierarchies of folders reflecting intricate filing systems that inevitably contain inadvertent (and typically widely disparate) redundancies. For example, at the time of this writing, I have 392 folders in my personal mail store. Such complex folder hierarchies are now so commonplace that most people think that they're "good enough", when actually they just aren't aware that it can be better.

Indeed, the whole concept of an electronic file folder is modeled off the physical reality of a Manila folder in a filing cabinet. You take a memo (i.e., piece of paper), put it in a single folder, which goes in a single drawer, which goes in a single cabinet, which goes in a single row of cabinets, and so on. This means that there is one path to get to that particular piece of information. You can photocopy that paper and put it in other folders to make multiple paths to the information, but that's pretty inefficient and you have obvious problems such as what happens if someone updates the original memo? You then have to go update each copy -- which could be pretty labor-intensive.

While this is a perfectly valid and reasonable approach to filing information, I claim that such a limiting system (i.e., only one path to a given piece of information) is not necessary in the electronic world. Indeed, this limitation is based on a physical model -- why carry it over to the electronic world?

Instead, look the collection of your mail as a knowledge repository. It contains vast stores of information. The challenge is not only to keep this information organized so that you can quickly find the data that you need, but also to be able to dynamically change the filing system as the need arises.

Tenet #1: Provide multiple paths to information.

A basic precept of many Knowledge Management (KM) solutions today is that information should be reachable by many different paths. In order to accomplish this, one must look at filing information "from the other way" -- instead of putting large amounts of information in a filing system, attach large numbers of filing systems to each piece of information.

For example, say you receive an e-mail from your friend Bob in Human Resources. Bob's mail tells you the specifics of a job opening in the Finance department that you are interested in, but also makes a friendly wager of $10 on the outcome of a football game this weekend.

How do you file this message? There's at least two different ways to look at it -- job related and personal. You obviously want to keep both pieces of information and be able to find them later. Do you file it under "job prospects" or "personal:bets with bob"? Clearly, you'd really want to file it under both.

Granted, under today's mail clients, although you can copy a message and put it in both places, the underlying assumption is that you'll normally file a message in one location -- so copying/re-filing messages is not as easy as it should be. Regardless, abstractly speaking, you've now got two copies of the information, and simply made distinct single paths to each. In a more practical sense -- you've now doubled the storage required for that message. What if Bob had attached a 2MB document detailing the Finance job? In days of shrinking IT budgets, you've just used 4MB of your valuable personal disk quota simply because you want to find the information two different ways.

Instead, it would be much more efficient (and potentially easier on the user) to file the one message in both places. You should not be penalized (in terms of space, for example) for wanting to find the same piece of information in multiple ways. Not only should it be possible, it should be trivial to file that mail in multiple places.

Multiple paths to information are also important because of the passage of time. What seems like a logical way to file information today may be completely forgotten tomorrow. So if a user files a message in many different ways today, their chances of finding it tomorrow are much greater because it may be found in multiple different places (as opposed to finding the one-and-only-one location where that information was filed).

Keep in mind that I'm talking about features that don't [yet] exist -- bear with me, and assume that the actions that I'm talking about will have a simple and easy-to-understand graphical interface for users to use.

Tenet #2: Separate the filing system from the information.

Most users' mail folders hierarchy started off as a small, simple set of folders. But it evolved over time -- new messages arrived that didn't quite fit into the existing neat and clean division of folders, so add a folder here, add another there, etc. After a while, adding folders in this piecemeal fashion results in a "big ball of mud" -- kludge after kludge after kludge inevitably results in inadvertent redundancies, ambiguities, and downright misplaced messages and folders.

A user's mail folders hierarchy becomes so large and complex that it effectively becomes a "legacy system". Even if the user wants to reorganize everything into a "better" filing system, it would take enormous amounts of time and effort to do so because each message would have to be examined, re-categorized, and then dragged-n-dropped in from the old filing system to the new. Hence, people tend to stay with their "big ball of mud" model, even if they know that it's inadequate, inefficient, or otherwise sub-optimal.

Specifically: with mail folders, the information is in the filing system rather than the other way around. Instead, the filing system should be completely divorced from the data that it contains. As with tenet number 1, information is king -- the filing system (although it can be considered information itself) should not only be dynamic and changeable, it is definitely secondary to the information itself.

Using such an approach would allow users to reorganize all of their mail without significant effort. Granted, it would still require some effort on the user's part (there's no such thing as a free lunch, after all), but the threshold of effort is significantly less if the filing system can be created and destroyed at will with no risk of loss of information. Consider: with mail folders, if you destroy the filing system, you may accidentally destroy information as well. If the filing system is totally separate from the information, then accidental data loss cannot occur.

With a separated filing system, not only can the information be reorganized on the fly, there can also be multiple simultaneous filing systems. Consider the example from above -- Bob's mail to you about a job posting in Finance and a friendly wager on this weekend's game. Taking that mail and attaching two different filing systems to it would allow you to file it under both "job prospects" and "personal:bets with bob" -- yet still only be a single message (rather than two copies of the same message in to different file folders).

Another common example is outgoing e-mail. With current mail systems, there is an arbitrary (IMHO) separation between incoming and outgoing mail. If you want to group all the messages of a given conversation together, you have to move or copy your outgoing messages out of your "sent mail" folder into the destination folder where you stored Bob's incoming messages (or some variation on this, such as CC'ing yourself, etc.). If you separate the filing system, you can simply select to see "all messages to or from Bob" -- there's no more artificial separation between incoming and outgoing e-mail. Of course, it is trivial to select one or the other if the user wants to -- outgoing mail is every message that has a "From" of their address, and incoming mail is everything else. It's just important that the possibility of combined listings becomes available under a separated filing system.

Tenet #3: Let the computer do the menial work.

E-mail has become so important to industry and society that users are often flooded with incoming mail every day. Answering and keeping up with e-mail has become a significant portion of people's jobs. This results in some of the problems described above (e.g., the "big ball of mud" approach to organizing e-mail). One way to help is to let the computer handle as much of the menial work associated with e-mail as possible.

Rules and filters are two common features in e-mail clients today. These are actually Good Things. Unfortunately, few users actually understand or use them. And even among those who do, there is always the fear that important e-mails will get lost or otherwise go unnoticed. This goes back to the fact that the mail folders and filing system is [currently] the primary concern -- the actual individual e-mails are not the focus.

Using a separated filing system with the concept of rules, filters, and scoring will help make them "easy" from the user's perspective. Specifically, a separated filing system can guarantee that no e-mail will every be lost, and that using rules/filters/scoring will actually increase the possibility that important e-mails will be noticed. The goal should be to make it common to have lots of rules/filters -- the more, the merrier. Indeed, let the computer mark each incoming (and outgoing!) e-mail in 20 different relevant categories such that there are now 20 different ways for the user to notice that important mail, not just one (i.e., the old concept of an "inbox"). Filters can then be used to ensure that the important e-mail -- even though it shows up in 20 different categories -- is only brought to the user's attention and viewed once (rather than 20 times).

Granted, some interface work and user education will probably have to take place to make rules and filters understandable to most users, but re-orienting the filing system will guarantee that no e-mail will ever be lost due to faulty rules (something that is not necessarily true today). This may help reluctant users to "take the plunge" and actually start using rules/filters.

Of course, users should also be able to manually categorize/organize a message. Even though much of the processing can happen automatically, there will always be a need to manually classify a specific message, and/or re-designate a given message to be in a new (set of) category(ies).

Combining all these ideas together, the end result is that all messages (both incoming and outgoing) can be automatically categorized and organized when they arrive. Important messages can actually filter up to the top. Messages can be fully reorganized and recategorized on the fly. Arbitrary searches can be executed to find any given message with no searching restrictions. Searches can be performed on results of other searches. And so on.

The point is that this would be a fundamentally better filing system -- one that is flexible and powerful. Of course, it can be simplified down for those who don't want that kind of power (e.g., Gramma, who only gets 2-3 e-mails a day). Indeed, the entire "mail folder" concept can be fully emulated with the ideas described above. But for those who need it, this gives them a much better toolset to organize the information contained in their e-mail.

So let's make this a little more specific. From the user's perspective, let's define a few terms before we start talking about features and capabilities of such a system:

  • Category. A category is essentially what most people currently think of as a mail folder. Categories have names and are hierarchical. For example:

    • mailing lists

    • mailing lists:LAM users list

    • mailing lists:53 listserv

    • job postings

    • job postings:finance

    • job postings:finance:northeast

    • job postings:finance:southeast

    • job postings:accounting

    (where the ":" character separates categories and sub-categories)

    However, a huge difference between categories and mail folders is that any number of categories can be associated with each message. So Bob's e-mail to you about the job posting in finance may actually be in "job postings", "job postings:finance", and "job postings:finanice:northeast" (the job is in Pennsylvania).

  • View. A view is essentially the result of a search. Views are named as well, but begin with the special character "#". Like categories, views are hierarchical -- sub-views are further searches on the parent view. Here's some examples of common views, and descriptions of them:

    • #sent mail - all messages sent by me

    • #sent mail:to bob - all messages sent by me to Bob

    • #sent mail:to bob:this week - all messages sent by me to Bob during the past 7 days

    • #sent mail:yesterday - everything that I sent yesterday

    • #mail with bob - any message to or from Bob

    • #yesterday - all mail sent and received yesterday

    • #yesterday:sent mail - same as "#sent mail:yesterday"

    Again, the ":" character separates views and sub-views.

    Note that views are continually updated -- they are not the results of a one-time search. So when you send a mail to Bob, it immediately shows up in #sent mail, #sent mail:to bob, and #mail with bob. This completely destroys the artificial separation between outgoing and incoming mail -- users can now view entire conversations (including their own replies) with ease.

Categories and views can be navigated and browsed just like a conventional mail folder tree. So the usage scenarios are actually fairly similar to existing mail clients.

Since nothing like this exists in current client software, remember to take it on faith for the moment that we can make a nice, easy to use interface to support all this functionality. Use your imagination. :-)

The main use of rules will be to assign categories (other actions are also possible, such as deleting). Rules can search / match any aspect of a new message (incoming or outgoing) and assign categories as appropriate. It will be common to have lots of rules. For management purposes, rules can also be named (starting with the special character "%") and be hierarchical. Here's some examples:

  • %bob - matches any message that has a From, To, CC, or BCC of bob@mycompany.com.

  • %bob:bets - any message that is to or from Bob and contains the word "bet" or "wager", assign the category "personal:bets with bob" to it.

  • %bob:jobs - any message that is to or from Bob and contains the words "job posting", assign the category "jobs"

  • %bob:jobs:finance - any message that is to or from Bob and contains the words "job posting" and "finance", assign the category "jobs:finance"

  • %spam - delete any message that has a subject beginning with "ADV:"

  • %spam from foo.com - delete any message that relays through the "foo.com" mail server

Consider the following usage scenario: All new mail will have the default "inbox" category attached to it. User-created rules will attach additional categories to each incoming message. Finally, the user may manually assign more categories when viewing the individual message.

This means that to see new mail, the user simply views the "inbox" category. Since most users treat their inbox as "messages I have not processed yet", once the user reads and processes a message in the inbox, the user can simply detach the "inbox" category and it disappears from the "messages I have not processed" list. Note that the message still remains filed away in all of its other categories.

This will be an important distinction, actually -- the difference between deleting a message (which completely destroys the message), or removing it from a given category / view.

The capability to search (i.e., define a view) on anything is one of the key concepts of this system. Users can search on category names, any field in the message header (e.g., From, To, CC, Subject, Message-Id, etc.), and any combination thereof. For example, you don't actually need a category for "mail from Bob" -- such a view will be available because the underlying system automatically indexes on the "From" field -- you can simply have a view of the value "bob@mycompany.com" in the From field. Categories are more intended to further organize messages in addition to examining all fields in each message's header.

Most modern mail clients offer some form of search capability (ranging from very primitive keyword searches to sophisticated field text pattern matching searches), but most are still bound to the mail folders concept -- searching scopes cannot be dynamic (i.e., based on a view), and the results of a search cannot themselves be searched. Plus, they're one-time searches, not continual views into the current pool of messages.

Basically, what it comes down to is removing some of the arbitrary artificial constraints concerning the storage and retrieval of mail messages -- allow any given message to be filed in any number of ways, combined with the idea of a high degree of automation such that the incoming flood tide of mail can be automatically (and manually) organized in a dynamic manner.

Technical details

All of the above can be described with a few basic precepts:

  • using an RDBMS to store messages

  • using the full power of SQL to search for messages

  • indexing messages by reference, not by value

  • allowing an arbitrary number of user-defined, hierarchical categories to be attached to a message, and indexing on those categories

  • automatically indexing each message by every field in the message header

Although there are at least several mail servers that use a real RDBMS on the back-end (many of us have been conditioned to think in terms of sendmail, which uses /var/mail-style flat files -- but not all mail servers do this), this is not quite what I'm talking about. The client needs to have visibility into the message store database. So even if the server uses an RDBMS back-end, if the client connects via IMAP or POP, it won't have access to the power of the RDBMS. Hence, we need something more.

A few approaches come to mind:

  1. Make all servers standardize on a common database schema. Then we can have open mail clients that can talk to any server (probably via some kind of ODBC connection), and life is good.

    But the practical possibility of getting this to happen is slim to none. Not only because back-end RDBMS schemas are proprietary and closed (and probably rightfully so), but also because trying to get all vendors to agree on a common database schema would be next to impossible.

  2. Make all servers standardize a common protocol to access the back-end database. Hence, open clients can connect to any server.

    Although this seems tempting, recall that SQL effectively fills this requirement (tunneled over whatever network protocol is appropriate, such as some flavor of ODBC). So abstracting away the SQL while still giving all the power of searching and whatnot (that SQL is designed to do), we'd really only be going a half step above SQL itself. So while I don't want to discount this possibility (since it would be much easier to get vendors to support a protocol than to force them in a specific database schema), I think some experience needs to be gained with the whole RDBMS approach first before anyone could understand enough to design such a protocol.

  3. Separate the mail server from the message store. An easy example of this would be to have a sendmail server with a customized mail.local (or every user has a .forward) that inserts the incoming message into a database instead of /var/mail. A separate RDBMS server can be running (and not necessarily even on the same machine as sendmail) to accept both the incoming messages, as well as listen/respond to ODBC connections from clients.

    Yep -- that's right -- mail clients use ODBC to retrieve their mail. Forget opening /var/mail/username, and forget using mh-style folders. Just open up an ODBC connection (which can even be across the network -- no need for it to be local).

I'm thinking that #3 is the easiest to implement first. #2 might be possible after we understand #3 and gain some experience with database schemas that would be required to implement it.

Indeed, to implement #3, all you need is the following:

  • design a database schema that can handle all the requirements described above (this will actually take a considerable amount of thought and design to do properly)
  • an agent to insert new messages into the database (either a mail.local or an executable to be invoked from .forward), probably with a default "inbox" category attached to it
  • take an open source mail client, and, assuming that it has at least a semi-modular approach (and at best, a formal API) to reading/writing mail messages from/to mail folders, rip out the guts of the mail folders access routings and replace them with database calls

That's the basics. There's millions of features and details to be worked out, but that's the gist of it.

And here's some random thoughts / implications of what this all could mean:

  • Assumedly, the back-end database can either be a per-user database or a one-database-for-all-users. It would be nice to allow it both ways. But if one or the other has to be chosen, I think the all-users DB would be much more useful and user friendly. It would also allow "public folders" kind of functionality (see below), since everyone shares the same DB message store.
  • Searching on Message-ID to make message threading without the artificial separation of the sent-mail folder will now actually show the whole conversation, not just the messages that people sent to you (I love this idea!).
  • If incoming messages automatically have an "inbox" category attached to them, users can safely detach the "inbox" category and leave the message filed away in other categories. i.e., there needs to be a clear / easy way to do this that is distinct from "delete message".
  • Spam busting has great potential here -- you can even filter based on any machine that the message relayed through, not just originating e-mail addresses, etc.
  • Think of it the other way around -- take a single message, and show its relations to other messages. For example, message X has these categories, is part of this(these) thread(s), is one of 38 messages from Bob that you received today, and is the 25th out 48 message on the LAM listserv that you have received in the last week. And so on.
  • Basically -- anything you can do in SQL, you can do in a view. You can set the scope of a view to be arbitrarily large (all messages in the database) or arbitrarily small (a single message, or a single thread).
  • High-quality clients can still do local caching of messages (a la high-quality IMAP clients today) and views (i.e., results of searches) to improve client performance.
  • Key to all of this will be a simple and powerful interface. Create/edit/delete categories is simple enough. But making an interface that makes views and rules easy to create/edit/delete will be absolutely essential.
  • Views should be stored in the database itself. That is, whatever SQL or search string is necessary to execute the "#sent-mail" view should be stored in the database itself. Hence, if I connect with client A or with client B, I can still see the same "#sent-mail" view.
  • "Public folders" (a la MS Exchange, or any IMAP server) can be implemented with special, reserved categories. It may be a good thing to define some "system reserved" category prefixes that cannot be defined by a user.
  • If a single back-end database is used to store all user messages, system administrators actually have a larger degree of control over user mail spools. Consider -- many companies have a "max e-mail age" policy, such that mails over age X should not be kept. With a RDBMS back-end, a search and removal of messages older than X is trivial.
  • Some kind of message export from the database will probably need to be supported, such as dumping to /var/mail-style mail folders, mh-style folders, XML, or perhaps to another database.
  • Consider making a second ODBC connection to another server to be able to access other message stores. There are oodles of web-based listserv archives out there, why not give people raw access to a database containing the archives instead of forcing a web interface? The possibilities here are very interesting... Consider a mailing list where no mail is sent out via SMTP. Subscribers still submit mail via SMTP (i.e., conventional mail clients), but they simply make ODBC connections to "receive" mail from the list. As a subscriber, I would configure my mail client to make an ODBC connection not only to my "home" mail server, but also to the LAM listserv ODBC. Messages to the LAM list would still show up in my inbox view (if I wanted them to, that is), but they were never actually pushed via SMTP to every subscriber on the list -- they just appeared in the database, and clients pulled them. Granted, this has obvious scalability problems, so a more realistic example might be providing ODBC connections in a read-only fashion for archive searching, etc. (vs. everyday usage). But it's still interesting. :-)

The whole point here is that mail clients today are bound by artificially limiting data stores. If we remove those limiting factors and instead use a very powerful data store and start using KM kinds of tools with e-mail, the possibilities are truly interesting..

None of this is holy writ. Like I mentioned in the beginning, this was an idea brewing in my subconscious for a few years, and it only just took on words and active dialogue with others within the last week. So although this idea intrigues me greatly, if I ever get around to implementing it, it may be substantially different from what I have outlined from above. :-).

That being said, comments and suggestions on this are welcome.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


This page contains a single entry from the blog posted on September 23, 2002 8:56 PM.

The previous post in this blog was I saw the sunrise as.

The next post in this blog is Wooow... she even infused the meringue with an insusient hint of lemon zest!.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34